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Preface 



These are the proceedings of CHES 2003, the fifth workshop on Cryptographic 
Hardware and Embedded Systems, held in Cologne on September 8-10, 2003. As 
with every previous workshop, there was a record number of submissions despite 
the much earlier deadline in this year’s call for papers. This is a clear indication 
of the growing international importance of the scope of the conference and the 
relevance of the subject material to both industry and academia. 

The increasing competition for presenting at the conference has led to many 
excellent papers and a higher standard overall. From the 111 submissions, time 
constraints meant that only 32 could be accepted. The program committee work- 
ed very hard to select the best. However, at the end of the review process there 
were a number of good papers - which it would like to have included but for 
which, sadly, there was insufficient space. In addition to the accepted papers 
appearing in this volume, there were three invited presentations from Hans Dob- 
bertin (Ruhr-Universitat Bochum, Germany), Adi Shamir (Weizmann Institute, 
Israel), and Frank Stajano (University of Cambridge, UK), and a panel dis- 
cussion on the effectiveness of current hardware and software countermeasures 
against side channel leakage in embedded cryptosystems. 

As always, the focus of the workshop is on practical aspects of cryptographic 
hardware and embedded system security. A number of contributions pursue ideas 
on the efficient use of resources (such as time, chip area, or power) within con- 
strained devices such as smart cards. These treat a wide range of applications, 
including true random number generators, finite field and modular arithmetic, 
and symmetric ciphers. Most of the remaining papers are concerned with leak- 
age of secret key data via side channels such as time, power, or electromagne- 
tic radiation, or through fault induction. Some of the contributions show how 
to extract the secret key in particular circumstances, others are more generic 
methodologies. These are complemented by other papers which provide counter- 
measures for increased resistance against such attacks. Applications include all 
the standard cryptosystems, both symmetric and public key, as well as some less 
well known ciphers. Another point of interest is the extension to hyperelliptic 
cryptosystems. 

The CHES workshop series is now firmly established as an international 
forum for intellectual exchange in creating the secure, reliable, and robust secu- 
rity solutions which are required nowadays. CHES will continue to deal with the 
pressing hardware and software implementation issues as more and more systems 
and applications are developed which require encryption or authentication. 

We would like to thank Irmgard Kiihn (Ruhr-Universitat Bochum, Germany) 
for her help with the local organization and Andre Weimerskirch (also from 
Bochum) for his help again with the CHES website (www.chesworkshop.org) 
and Gokay Saldamli and Colin van Dyke (both from Oregon State University) 
for their help in preparing the proceedings. 
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The Security Challenges of Ubiquitous 
Computing 



Frank Stajano 

University of Cambridge 
http: //www-lce . eng. cam. ac.uk/~fms27/ 

Ubiquitous computing, over a decade in the making, has finally graduated from 
whacky buzzword through fashionable research topic to something that is def- 
initely and inevitably happening. This will mean revolutionary changes in the 
way computing affects our society — changes of the same magnitude and scope 
as those brought about by the World Wide Web. 

The performance of a computer of given cost has gone up dramatically 
throughout the whole history of computing. Even just the last decade has 
brought improvements worth several orders of magnitude along such diverse 
dimensions as processor speed, memory capacity, disk capacity, communication 
bandwidth and so on. As we overtake the “a computer for everyone” milestone 
and march steadily towards a future in which each person owns hundreds of 
computing objects, we will start to explore a different region of the computer 
design space: keeping the performance constant and making the cost vanishingly 
small. Think of throw-away embedded computers inside shoes, drink cans and 
postage stamps. 

Security engineers will face specific technical challenges such as how to pro- 
vide the required cryptographic functionality within the smallest possible gate 
count and the smallest possible power budget: the chips to be embedded in 
postage stamps will be the size of a grain of sand and will be powered by the 
energy radiated by an external scanning device. 

The more significant security challenges, however, will be the systemic ones. 
Ubiquitous computing is not just a wireless version of the Internet with a thou- 
sand times more computers, and it would be a naive mistake to imagine that 
the traditional security solutions for distributed systems will scale to the new 
scenario. Authentication, authorization, and even concepts as fundamental as 
ownership require thorough rethinking. The security challenges of the architec- 
ture are much greater than those of the mechanisms. 

At a higher level still, even goals and policies must be revised. Having hun- 
dreds of computers per person changes the situation to such an extent that even 
the most fundamental assumptions need reexamining. There are evident issues 
of privacy, but also of trust and control. One question we should keep asking is 
simply “Security for whomT" The owner of a device, for example, is no longer 
necessarily the party whose interests the device will attempt to safeguard. 

Ubiquitous computing is happening and will affect everyone. By itself it will 
never be “secure” (whatever this means) if not for the dedicated efforts of people 
like us. We are the ones who can make the difference. So, before focusing on the 
implementation details, let’s have a serious look at the big picture. 



C.D. Walter et al. (Eds.): CHES 2003, LNCS 2779, p. 1, 2003. 
© Springer- Verlag Berlin Heidelberg 2003 




Multi-channel Attacks 



Dakshi Agrawal, Josyula R. Rao, and Pankaj Rohatgi 
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P.O. Box 704 

Yorktown Heights, NY 10598 
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Abstract. We introduce multi-channel attacks, i.e., side-channel 
attacks which utilize multiple side-channels such as power and EM 
simultaneously. We propose an adversarial model which combines a 
CMOS leakage model and the maximum-likelihood principle for per- 
forming and analyzing such attacks. This model is essential for deriving 
the optimal and very often counter-intuitive techniques for channel 
selection and data analysis. We show that using multiple channels is 
better for template attacks by experimentally showing a three-fold 
reduction in the error probability. Developing sound countermeasures 
against multi-channel attacks requires a rigorous leakage assessment 
methodology. Under suitable assumptions and approximations, our 
model also yields a practical assessment methodology for net infor- 
mation leakage from the power and all available EM channels in 
constrained devices such as chip-cards. Classical DPA/DEMA style 
attacks assume an adversary weaker than that of our model. For this 
adversary, we apply the maximum-likelihood principle to such design 
new and more efficient single and multiple-channel DPA/DEMA attacks. 

Keywords: Side-channel attacks, Power Analysis, EM Analysis, DPA, 
DEMA. 

1 Introduction 

1.1 The Problem 

Recent research in side-channel attacks has validated and reinforced the ob- 
servation that sensitive information can leak from cryptographic devices via a 
multitude of channels. The seminal work of [9,8] describing leakages in timing 
and power channels was followed by the work of [10,7,1] showing leakages via 
electromagnetic (EM) emanations. The work of [1] shows that even a single EM 
probe can yield multiple EM signals via demodulation of different carriers. Fur- 
ther, different EM carriers carry different information and some EM leakages 
exceed leakages in the power channel. All these channels provide a rich source 
of information for a determined adversary. 

While it seems plausible that side-channel attacks can be significantly im- 
proved by capturing multiple side-channel signals such as the various EM chan- 
nels and possibly the power channel, a number of questions remain. Which 
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side-channel signals should be collected? How should information from various 
channels be combined? How can one quantify the advantage of using multiple 
channels? These issues are especially relevant to an attacker since a significant 
equipment cost is associated with capturing each additional side-channel signal. 
Furthermore, in some situations, the detection risk associated with the addi- 
tional equipment /probes required to capture a particular side channel has to be 
weighed against the benefit provided by that channel. 



1.2 Contributions 

To address these issues, we present a formal adversarial model for multi-channel 
analysis using the power and various EM channels. ^ Our model is based on a 
leakage model for CMOS devices and concepts from the Signal Detection and 
Estimation Theory. This formal model can be used to assess how an adversary 
can best exploit the wide array of signals available to him. In theory, this model 
can also deal with the problem of optimal channel selection and data analysis. 
However, in practice, a straight-forward application of this model can sometimes 
be infeasible. We show a judicious choice of approximations that renders the 
model useful for most practical applications. 

Formulating such an adversarial model has numerous pitfalls. Ideally, the 
model should capture the strongest possible multi-channel attacks on an im- 
plementation of a cryptographic algorithm involving secret data. While such a 
model is easy to define, using it to assess vulnerabilities and create attacks will 
shift the focus from multi-channel information leakage to the specifics of the 
algorithm and implementation. 

To refocus the attention on information leakage from multiple side-channels, 
we will only consider elementary leakages, i.e., information leaked during ele- 
mentary operations of CMOS devices. This allows us to deal with information 
leakage aspects of multiple channels while not losing sight of the goal of eval- 
uating entire implementations. In fact, it can be shown that the leakage in an 
entire computation is just the composition of elementary leakages from all of its 
elementary operations [2]. 

We introduce an adversarial model that is based on this view of elementary 
leakages of CMOS devices and is phrased in terms of the maximum likelihood 
testing of hypotheses. The model provides a formal way of comparing efficacies of 
various signal selection and processing techniques that can be used by a resource 
limited adversary. 

Applying the model to the problem of signal selection, we find that the opti- 
mal strategies for picking even two best side-channels from a set of possibilities 
can be complex and counter-intuitive. For instance, picking the two channels with 
the best signal-to-noise ratios is quite often sub-optimal. The model also shows 
how to best combine information from multiple channels. This can be viewed as 
a generalization of template attacks [4] to the case of multiple channels. We pro- 
vide experimental evidence to show that multi-channel based template attacks 



^ Combining the timing and power channel is already known, e.g., [11]. 
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are superior to their single channel counterparts. Specifically, for a smart-card 
S^, we show that template attacks that use both an EM channel and a power 
signal are superior to attacks that use only a single channel. 

Our model for multi-channel attacks is also valuable for the designers of 
cryptographic implementations since they need to know the amount of leakage 
from multiple sensors to select the appropriate level of countermeasures. We de- 
scribe a methodology for assessing any type of leakage in an information-theoretic 
sense. The methodology permits the computation of bounds on the best error 
probability achieved by an all-powerful adversary. While such an assessment is 
impractical for arbitrary devices, it is feasible for the practically important case 
of chipcards with small word lengths. 

One drawback of our model is assumption of a very powerful adversary who 
has full knowledge of the characteristics of the target device and is capable 
of performing attacks similar to template attacks on the device. In practice, 
such attacks are tedious to mount and often adversaries don’t have knowledge 
about the device. Thus, DPA-style attacks continue to be important due to their 
simplicity and immediate applicability to unknown implementations. Using the 
maximum likelihood testing as a basis, we show how current single channel 
DPA-style attacks can be greatly improved and how multiple-channel DPA-style 
attacks can be designed. The key to these improvements is a relaxation of the 
maximum likelihood test which estimates the unknown parameters of the test 
on the fly. We provide empirical evidence to show that a better analysis can 
give a substantial reduction in the number of samples needed for a traditional 
DPA attack and even a better reduction factor when a multiple-channel DPA 
attack is carried out using a power and an EM channel with very similar leakage 
characteristics. 

2 Adversarial Model 

This section develops an adversarial model to formally address issues related to 
the leakage of information via multiple side-channel signals. 

2.1 CMOS Side- Channel Elementary Leakages 

In CMOS devices, all data processing is typically controlled by a “square-wave” 
shaped clock. Each clock edge triggers a short sequence of state changing events 
and corresponding currents in the data processing units. The events are transient 
and a steady state is achieved well before the next clock edge. At any clock cycle, 
all the events and resulting currents are determined by a comparatively small 
number of bits of the logic state of the device, i.e., one only needs to consider 
the state of active circuits during that clock cycle and not the entire state of the 
device. These bits, termed as relevant bits in [3], constitute the relevant state of 
the device. 

^ A pseudonym is used to protect vendor identity; S is a 6805-based sub-micron, double 
metal technology card with inbuilt noise generators 
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Signals found on side-channels such as power and EM result from the cur- 
rent flows within the device and are affected by the random thermal noise. As 
mentioned above, ideally, the current flows in a CMOS device are directly at- 
tributable to the relevant state of the device. However, in practice, there may 
be many very small leakage currents in the inactive parts of the circuit. These 
leakages can be approximated as a small Gaussian noise term having negligible 
correlation with any particular active part of the circuit. 

Thus as a very good approximation, all side-channel emanations during a 
clock cycle carry information only about the events and the relevant state of 
the device that occurs during the clock cycle. This is strongly supported by the 
experimental results which show that algorithmic bits are significantly correlated 
to the power/EM signals only during the clock cycles when they are actively 
involved in a computation. Thus it is natural to model side-channel leakage 
from the CMOS devices in terms of the leakages of the relevant state that occur 
during each clock cycle. We term the operation performed by the device during 
each clock cycle as an elementary operation and define the corresponding leakage 
of the relevant state information from side-channels as an elementary leakage. 



2.2 Adversarial Model for Elementary Leakages 

Given the concept of elementary leakages, it is natural to formulate side-channel 
attacks in terms of how successful an adversary can be in obtaining information 
about the relevant state. For example, an adversary may be interested in the 
LSB of the data bus during a LOAD instruction. This has a natural formulation 
as a binary hypothesis testing problem for the adversary^. Such a formulation 
also makes sense as traditionally the binary hypothesis testing has been central 
to the notions of side-channel attack resistance and leakage immunity [3,5]. 

The adversarial model consists of two phases. The first phase, known as the 
the profiling phase, is a training phase for the adversary. He is given a training 
device identical to the target device, an elementary operation, two distinct prob- 
ability distributions Bq and B\ on the relevant states from which the operation 
can be invoked and a set of sensors for monitoring side-channel signals. The 
adversary can invoke the elementary operation, on the training device, starting 
from any relevant state. It is expected that adversary uses this phase to prepare 
an attack. 

In the second phase, known as the hypothesis testing phase, the adversary 
is given the target device and the same set of sensors. He is allowed to make a 
bounded number of invocations to the same elementary operation on the target 
device starting from a relevant state that is drawn independently for each invo- 
cation according to exactly one of the two distributions Bq or B\ . The choice of 
distribution is unknown to the adversary and his task is to use the signals on 
the sensors to select the correct hypothesis {Hq for Bq and Hi for Bi) about 

® In general, the adversary faces an M-ary hypothesis testing problem on functions 
of relevant state, for which results are straightforward generalizations of binary hy- 
pothesis testing. 
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the distribution used. The utility of the side-channels can then be measured in 
terms of the success probability achieved by the adversary as a function of the 
number of invocations. 



2.3 Sophisticated Attack Strategies 

Assume that an adversary acquires L statistically independent sets of sensor 
signals Oi,i = 1, ... ,L. These L sets of signals may correspond to L invocations 
of an operation on the target device. Also assume that there are K equally 
likely hypotheses Hk, k = 1, . . . , K, on the origin of these signals. Let p{0\H) 
be the probability distribution of the sensor signals under the hypothesis H. 
Under these assumptions, the maximum likelihood hypothesis test is optimal and 
it decides in favor of the hypothesis Hk if 

L 

k = argmaxTTp(0^|i/fc). (1) 

l^k^K “7 

While the maximum likelihood test is optimal, it is usually impractical as 
an exact characterization of the probability distribution of the sensor signals 
O may be infeasible. Such a characterization has to capture the nature of each 
of the sensor signals and the dependencies among them. This could further be 
complicated by the fact that, in addition to the thermal noise, the sensor signals 
could also display additional structure due to the interplay between properties 
of the device and those of the distributions of the relevant states. For example, 
if the hypothesis was on the LSB of a register while the device produced widely 
different signals only when the MSB was different, the sensor signals will display 
a bimodal effect attributable to the MSB. It turns out that in practice one 
can obtain near optimal results by making the right assumptions about the 
sensor signals. Such assumptions greatly simplify the task of hypothesis testing 
by requiring only a partial characterization of sensor signals. 



The Gaussian Assumption. One such widely applicable assumption is the 
Gaussian assumption which states that under the hypothesis H , the sensor signal 
O has a multivariate Gaussian distribution with mean pn and a covariance 
matrix Eh- A multivariate Gaussian distribution p(-\H) has the following form: 

P(o\H) = —, I exp{-ho- phVE]j\o- ph)), oG7^”, (2) 

y'{2TT}^\EH\ ^ 

where \Eh\ denotes the determinant of Eh and Ej^^ denotes the inverse of Eh- 
The Gaussian assumption holds for a large number of devices and hypotheses 
encountered in the practice. In fact this assumption has been used successfully 
in the case of chip-cards [4]. 
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It can be shown that under the Gaussian assumption, the maximum likeli- 
hood hypothesis testing for a single observation O and two equally likely hy- 
pothesis Hq and Hi^ simplifies to the following comparison: 

( 3 ) 

where a decision is made in favor oi Hi if the above comparison is true, and in 
favor of Hq otherwise. 

In many cases of practical interest, noise in the sensor signals does not depend 
on the hypothesis, that is, Shq = = Sfq. In such cases, the following 

well-known result from the Statistics gives the probability of error in maximum- 
likelihood hypothesis testing [12]: 

Fact 1 For equally likely binary hypotheses, the probability of error in the max- 
imum likelihood testing is given by 




where = {fiHi— hHo)^ ~hHo) a-nd erfc(x) = 1 — erf(a;). Note that 

has a nice interpretation as the optimal signal-to-noise ratio that an adversary 
can achieve under the Gaussian assumption. 

In the rest of this section, we will present two applications of the theory 
discussed above. In the first application, a strategy for selecting multiple side 
channels is presented. In the second application, a template attack on multiple 
channels is devised. 

2.4 Multiple Channel Selection 

Consider a resource limited adversary who can select at most M channels for an 
attack. When viewed in terms of our model, this problem conceptually has a very 
simple solution: The adversary should choose those M channels that minimize 
his probability of error in the maximum likelihood testing. 

This apparently simple technique can be quite subtle and tricky in prac- 
tice. Clearly, in situations where a well-prepared adversary has nicely character- 
ized/approximated signals from each of the channels under each hypothesis and 
the corresponding joint noise probability distribution between all the channels, 
the adversary can also calculate the error probability for each possible choice 
of M channels, at least for small M. For example, if the noise is Gaussian and 
independent of the hypothesis, then from Equation 4, since erfc(-) decreases ex- 
ponentially with A, the goal of an adversary limited to just two channels, would 
be to choose channels in such a manner, as to maximize the output signal-to- 
noise ratio 

^ Generalizations to multiple observations and more than two hypotheses are straight- 
forward. 
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If instead of a rigorous approach, channels are selected by heuristic tech- 
niques, then the resulting selection could be sub-optimal for various subtle rea- 
sons. Firstly, different side-channels could leak different aspects of information 
relative to the hypotheses being tested and sometimes there could be value 
in combining channels which provide widely dissimilar information rather than 
combining those which provide similar but partial information. Secondly, even 
if many channels provide the same information, picking multiple channels from 
this set could still be valuable since that may be almost as good as having the 
ability to make multiple invocations of the device with the same data and col- 
lecting a single side-channel. Even for the case where only two side-channels 
can be selected, the optimal choice is quite tricky and subtle as shown by the 
example below where the naive approach of choosing the two signals with best 
signal-to-noise ratios is shown to be sub-optimal. 

Example 1. Consider the case where an adversary can collect two signals [0\, 
02 ]^ at a single point in time, such that under the hypothesis Hq, Ok = Nk, 
for k = 1,2, and under the hypothesis Hi, Ok = Sk + Nk- Assume that Ni = 
(Ni,N 2 )'^ has zero mean multivariate Gaussian distribution with 




Note that Oi and O2 have signal-to-noise ratios of Sf and S'! respectively. After 
some algebraic manipulations, we get 

2 _ (^1 + ^2)^ ^ jSi - 82 )^ ... 

2(1 + p) + 2(l-p) 

Now, consider the case of an adversary who discovers two AM modulated carrier 
frequencies which are close and carry compromising information, both of which 
have very high and equally good signal-to-noise ratios (S'! = S 2 ) and another 
AM modulated carrier in a very different band with a lower signal-to-noise ratio. 
An intuitive approach would be to pick the two carriers with high signal-to-noise 
ratio. In this case = S 2 and we get, = 2Sf/{l + p). Since both signals 
originate from carriers of similar frequencies, the noise that they carry will have 
a high correlation coefficient p, which reduces at the output. On the other 
hand, if the adversary collects one signal from a good carrier and the other from 
the worse quality carrier in the different band, then the noise correlation is likely 
to be lower or even 0. In this case: 

2 (Sl + S 2 y (S'! - S'2)^ 2/., I C.2/C.2^ /f.', 

A = ^ ^ = Si(l + S2/S1) ( 6 ) 

It is clear that the combination of a high and a low signal-to-noise ratio signals 
would be a better strategy as long as S 2 /SI > (1 — p)/(l + p). For example, if 
p > 1/3, then choosing carriers from different frequency bands with even half 
the signal-to-noise ratio results in better hypothesis testing. | 

Based on above analysis, in our experiments we routinely rejected a stronger 
channel which is colocated with another collected channel and chose a channel 
further away in the spectrum even if it had a lower signal-to-noise ratio. 
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2.5 Multi-channel Template Attacks 

In [4], the power of using the maximum likelihood principle together with the 
Gaussian assumption was shown to be very effective in classifying successive 
bytes of an RC4 key using a single power side-channel signal. Expanding the tem- 
plate approach to multiple channel is straightforward. In the template attack, at 
any stage, the adversary uses an identical device to build exact characterizations 
for the signal and noise for each of the K possibilities he has to classify. Then he 
uses these characterizations to classify the one signal he is given from the tar- 
get device. The first step in the template approach is the identification of those 
time instances (or indices of sample points) where the average signals for each 
of the K possibilities differ significantly. The second step is to compute the joint 
noise distribution of the channel at those points for each of the K possibilities. 
The third step is to classify the given signal into the K possibilities using the 
maximum likelihood testing. 

For multiple-channels, the template attack is identical except that the signals 
from the multiple channels are concatenated together to yield a larger signal, i.e., 
for each invocation, a combined signal is created by concatenating the signals 
from the individual observed channels. Notice that the process of identifying the 
time instances and sample points could end up selecting somewhat different time 
slices for each channel, depending purely on the nature of leakage in each chan- 
nel. The maximum likelihood testing will pick up information from all channels 
(possibly at different times) for classification. 

To show that multiple channels help the classification process, we invoke an 
operation on the smart card S with two different input bytes and look at just 3 
cycles during which the input was first processed. We collected EM and power 
samples simultaneously and evaluated how well the template attack could classify 
a single EM/power trace into the two hypotheses HO and HI for the input byte. 
We did this classification first using exactly one of the power/EM channels and 
then performed the classification using both channels simultaneously. Figures 1 
shows the mean EM and Power signals for these hypothesis during these 3 cycles 
side by side.® Fig 2 shows the error rate of our classification effort for inputs 
belonging to each hypothesis. One can clearly notice that using both channels 
simultaneously results in better classification compared to any single channel. 

3 Leakage Assessment Methodology for Chipcards 

The model developed in Section 2.2 can be used to derive a practical method- 
ology for assessing information leakage from any L power and EM channels for 
simple CMOS devices such as 8-bit chipcards. Several key properties make such 
a methodology feasible. Firstly, for a fixed relevant state, the noise at any cycle 
is well-approximated by a Gaussian distribution. Thus, in the hypothesis test- 
ing phase, the problem becomes one of distinguishing between two distributions 

The slight offset in time is due to delay of EM signals with respect to the power 
signal. 
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Fig. 1. Mean Power and EM signals during 3 cycles for two hypothesis 



Correct Hypothesis 


Error(Pwr) 


Error(EM) 


Error(EM-l-Pwr) 


HO 


9.5% 


15.1% 


2.8% 


HI 


20.1% 


15.2% 


6.6% 



Fig. 2. Signal classification error using Power, EM and combination of Power and EM 



Bq and Bi which are mixtures of Gaussians. Thus, if the number of relevant 
states (typically exponential in twice the word length) is small, each Gaussian 
in the collection can be profiled and the success probability for hypothesis test- 
ing can be computed. The problem of capturing leakages across multiple bands 
in multiple channels can be practically solved by splitting each channel into 
slightly overlapping bands upto a reasonable upper limit. Details of this assess- 
ment methodology with such practical considerations are given in the Appendix. 



4 Single and Multi-channel DPA Attacks 

In section 2, we assumed that the adversary had access to a test device identical 
to the target device and that he could carry out a profiling stage using the test 
device. In many circumstances, access to a test device may not be possible. In 
such cases, a DPA-style attack is preferred since it assumes no prior knowledge of 
device characteristics or implementation. In this section, we apply tools from the 
detection theory to optimize existing single channel DPA attacks and propose 
new multiple channel DPA attacks. 



4.1 Improving DPA 

In the traditional DPA attack, an adversary collects a set of N signals, 0^,1 = 
1, . . . , A emanating from a given channel. Assume that the signals are normalized 
to have zero sample average over all N signals. For each hypothesis H under 
consideration, the N signals are divided into two bins, termed the 0-bin and the 
1-bin with Nh.o and Nh,i samples respectively. Let ^J-H,o[j] and HH,i[j] be the 
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sample means of signals in the 0-bin and the 1-bin respectively for the hypothesis 
H. The next step in the DPA attack consists of computing the differences of 
sample means ~ for all hypotheses, and deciding in favor 

of the hypothesis Hi if has the largest peak among all differences of 

means. In other words, the decision metric for the hypothesis H at time j is 
given by 



MhIJ] = ) ( 7 ) 

and the decision is made in favor of the hypothesis Hi if for some value of j, say 
jo, MH,[jo] >= MhIJ] for all H and j. 

The traditional DPA attack and its variations have been successfully applied 
to attack several cryptographic implementations. However, by using the theory 
developed in the previous section, the effectiveness of traditional DPA can be 
increased significantly. 

Before proceeding further, assume a void hypothesis H^ which corresponds 
to a random bifurcation of the N signals into the 0-bin and the 1-bin. Using 
the Gaussian assumption and Equation 3, the metric of a hypothesis Hi with 
respect to the null hypothesis at time j is given by 

v[mAj\] v[hhM ""^vIhhAAV 

( 8 ) 

In order to compute this metric, we need the values of the following parameters: 
EjiH„[j], E[^}i[j\\, and U[/x//[j]]. Since in the DPA attack, the ad- 

versary skips the profiling phase of the attack, (8) is not directly applicable. In 
such cases, the theory suggests that unknown parameters of the test equation be 
estimated directly from the collected signals. If the adversary uses a maximum- 
likelihood estimate of these parameters, then the resulting test is referred to as 
the generalized maximum-likelihood testing. 

For the DPA attack, calculating the maximum likelihood estimate of the 
test parameters involves solving a set of nonlinear coupled equations. Therefore, 
instead of using the maximum- likelihood estimates of these parameters, we use 
sample estimates as follows: Let cr|^ob1 be the sample variances of 

the signals in the 0-bin and the 1-bin respectively at time j for hypothesis H. 
We propose the following sample estimators® of parameters in (8): 



E[fJ-H[j]] 

V[f^H[j]] 



Mh[j] 

g^g.ob'] 

No 



Ni 



(9) 



We omit the derivation of these estimators as the derivation is tedious and follows 
from straight-forward algebraic manipulations. 
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Fig. 3. DPA results, mean-difference vs. approx, generalized maximum-likelihood 



Substituting these in (8), we get the following formula for the metric: 



MhAj\ 



(w.bl-M.bl)’ + % 

<„oM , -L.ibl , <„i 

No Ni No ^ Ni 



(10) 



Intuitively, the signals in the 0-bin and 1-bin have similar distributions un- 
der the wrong hypothesis due to a random bifurcation of signals in the two bins. 
However, for the correct hypothesis, the distribution of signals in the 0-bin differs 
from the distribution of signals in the 1-bin. The traditional DPA attack only 
takes into account the differences in sample means. On the other hand. Equa- 
tion 10 takes both the sample means and variances into account, and therefore 
may provide a better hypothesis test. 

Figure 3 shows the results of applying this method to attacking the S-box 
lookup for a DES implementation. The first column shows the bit being pre- 
dicted, the second shows the number of samples required for the correct key 
hypothesis to emerge as the winner under the traditional DPA metric while the 
third column shows the number of samples needed with the new metric. Clearly 
by using a better metric, our improvement in the DPA attack reduces the number 
of signals needed by a factor of 1.4-3. 



4.2 Multi-channel DPA Attack 

Multi-channel DPA attack is a generalization of the single-channel DPA attack. 
In this case, the adversary collects N signals, Oi,i = 1,...,A. In turn, each 
of the signals is a collection of L signals collected from L side-channels. 
Thus, Oi = [O), . . . , where 0( represents the i-th signal from the Cth 
channel. Note that all DPA style attacks treat each time instant independently 
and leakages from multiple channels can only be pooled together if they occur 
at the same time. Thus, in order for multi-channel DPA attacks to be effective, 
the selected channels must have very similar leakage characteristics. 

The formulae for computing the metric for multi-channel DPA attack are 
generalizations of those for the single channel. The main difference is that the 
expected value of sample mean difference at time j under hypothesis iJ is a 
vector of length L, with the l-th entry being the sample mean difference of the 
/-th channel. Furthermore, the variance of the 6-bin under hypothesis H at time 
j, is a covariance matrix of size Lx L with the t, j-th entry being the correlation 
between signals from the i-th and j-th channels. Once again, as in the DPA 





Multi-channel Attacks 



13 



attack, the adversary does not have the luxury of estimating these parameters. 
Therefore, we substitute sample estimates for these parameters along the same 
lines as in Equation 9. We skip the cumbersome formulae and directly go to the 
results of multi-channel DPA attacks. 

Figure 4 shows sample results of an attack on the S-box lookups in a DES 
implementation using the power channel together with an EM channel whose 
leakage is similar to the power channel. The first column shows the bit being 
predicted, the second shows the number of signals required for the correct key 
hypothesis to emerge as the winner using both channels with the multi-channel 
metric, the last two columns show the number of signals needed for the power 
and EM channels separately using the new DPA/DEMA metric. From this it 
is clear that the number of invocations needed for two channel attacks can be 
significantly less compared to single-channel attacks. 
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Fig. 4. Multi-Channel DPA-style attack using Power, EM and Power&EM. and EM 



4.3 Future Work on Single/Multi-channel DPA/DEMA Attacks 

It is well known to DPA/DEMA practitioners that for the correct hypothe- 
sis, the correlation signal with respect to time shows multiple peaks. However, 
current analysis techniques, including the ones presented here, do not combine 
information from peaks occurring at different time instances. This problem also 
manifests itself when combining various Power and EM channels since peaks on 
different channels may not coincide. One can also view the efficacy gap between 
template attacks and DPA attacks as a manifestation of the same problem. 

We have started work which promises to bridge this gap. The main idea is 
to estimate the characteristics of useful peaks on the fly given only the collected 
signals (without using a training set) and apply techniques based on maximum- 
likelihood principle to identify the correct hypothesis. 
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Appendix: Leakage Assessment for Chipcards 

In this section, we address the question of whether one can assess and quan- 
tify the net leakage of information from multiple sensors. Can the information 
obtained by combining leakages from several (or even all possible) signals from 
available sensors be quantified regardless of the signal processing capabilities and 
computing power of an adversary? 

Maximum likelihood testing is the optimal way to perform hypothesis test- 
ing. Thus, we use it to craft a methodology to assess information leakage from 
elementary operations. Our methodology takes into account signals extractable 
from all the given sensors across the entire EM spectrum. Results of such an 
assessment will enable one to bound the success probability of the optimal ad- 
versary for any given hypothesis. 

Assume, that for a single invocation, the adversary captures the emanations 
across the entire electromagnetic spectrum from all sensors in an observation 
vector O. Let 17 denote the space of all possible observation vectors O. Since the 
likelihood ratio, A(0) is a function of the random vector O, the best achievable 
success probability, Ps, is given by: 
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Ps = ^ -f{yl(O)>l}PNl(0 — Si) -I- /{A(0)<1}PNo(0 — So) (11) 

oe o 

where I a denotes the indicator function of the set A, and pni(0 — Si) and 
Pno( 0 — So) are noise distributions under the hypothesis 1 and 0. 

When the adversary has access to multiple invocations, an easier way of 
estimating the probability of success/error involves a technique based on mo- 
ment generating functions. We begin by defining the logarithm of the moment 
generating function of the likelihood ratio: 

f,{s) = ln( ^ - Si)pJ,-^(0 - So)) (12) 

O G 1? 

The following is a well-known result from Information Theory: 

Fact 2 Assume we have several statistically independent observation vectors'^ 
Oi, O 2 , . . . ,Ol For this case, the best possible exponent in the probability of 
error is given by the Chernoff Information; 

C =^- min /r(s) (13) 



Note that /u(-) is a smooth, infinitely differentiable, convex function and there- 
fore it is possible to approximate Sm by interpolating in the domain of interest 
and finding the minima. Furthermore, under certain mild conditions on the pa- 
rameters, the error probability can be approximated by: 



Pe 



1 

(s^)s^(l 



exp{Lii{sm)) 

^m) 



(14) 



Note that in order to evaluate (11) or (14), we need to estimate pno(-) and 
Pni(’)- general, this can be a difficult task. However by exploiting certain 
characteristics of the CMOS devices, estimation of Pno(') and Pni(’) can be 
made more tractable. 



Practical Considerations 

We will now outline some of the practical issues associated with estimating pno(-) 
and pni(') for any hypothesis. The key here is to estimate the noise distribution 
for each cycle of each elementary operation and for each relevant state R that 
the operation can be invoked with. This results in the signal characterization, 
S/j, and the noise distribution, pnr(') which is sufficient for evaluating pno(') 
and Pni(')- 

There are two crucial assumptions that facilitate estimating pnr(')- first; on 
chipcards examined by us the typical clock cycle is 270 nanoseconds. For such 

^ For simplicity, this paper deals with independent elementary operation invocations. 
Techniques also exist for adaptive invocations. 
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devices, most of the compromising emanations are well below 1 GHz which can 
be captured by sampling the signals at a Nyquist rate of 2 GHz. This sampling 
rate results in a vector of 540 points per cycle per sensor. Alternatively, one can 
also capture all compromising emanations by sampling judiciously chosen and 
slightly overlapping bands of the EM spectrum. The choice of selected bands is 
dictated by considerations such as signal strength and limitations of the available 
equipment. Note that the slight overlapping of EM bands would result in a 
corresponding increase in the number of samples per clock cycle, however it 
remains in the range of 600-800 samples per sensor. 

The second assumption, borne out in practice (see [4]), is that fora fixed 
relevant state, the noise distribution pnr(') can be approximated by a Gaussian 
distribution. This fact greatly simplifies the estimation of pnr(') as only about 
one thousand samples are needed to roughly characterize Pnr(')- Moreover, the 
noise density can be stored compactly in terms of the parameters of the Gaussian 
distribution. 

These two assumptions imply that in order to estimate pnr(') for a fixed 
relevant state R, we need to repeatedly invoke (say 1000 times) an operation 
on the device starting in the state R, and collect samples of the emanations as 
described above. Subsequently, the signal characterization Sr can be obtained by 
averaging the collected samples. The noise characterization is obtained by first 
subtracting Sr from each of the samples and then using the Gaussian assumption 
to estimate the parameters of the noise distribution. 

The assessment can now be used to bound the success of any hypothesis 
testing attack in our adversarial model. For any two given distributions Bq and 
Bi on the relevant states, the corresponding signal and noise characterizations, 
Ao, S'i,pno(’)i andpNi(’)5 ^ weighted sum of the signal and noise assess- 
ments of the constituent relevant states Sr and pnr(')- The error probability of 
maximum-likelihood testing for a single invocation or its exponent for L invoca- 
tions can then be bounded using (11) and (13) respectively. 

We now give a rough estimate of the effort required to obtain the leakage 
assessment of an elementary operation. The biggest constraint in this process is 
the time required to collect samples from approximately one thousand invoca- 
tions for each relevant state of the elementary operation. For an r-bit machine, 
the relevant states of interest are approximately 2^'’; thus the leakage assessment 
requires time to perform approximately 1000 * 2^’’ invocations. Assuming that 
the noise is Gaussian and that each sensor produces an observation vector of 
length 800, for n sensors the covariance matrix Ajv has (800 * n)^ entries. It 
follows that the computation burden of estimating the noise distribution would 
be proportional to (800 * n)^. Such an approach is certainly feasible for an eval- 
uation agency, from both a physical and computational viewpoint, as long as 
the size of the relevant state, r, is small. In our experiments, we found such 
assessment possible for a variety of 8-bit chipcards. 
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Abstract. We present HMM attacks, a new type of cryptanalysis based on mod- 
eling randomized side channel countermeasures as Hidden Markov Models 
(HMM’s). We also introduce Input Driven Hidden Markov Models (IDHMM’s), a 
generalization of HMM’s that provides a powerful and unified cryptanalytic frame- 
work for analyzing countermeasures whose operational behavior can be modeled 
by a probabilistic finite state machine. IDHMM’s generalize previous cryptanaly- 
ses of randomized side channel countermeasures, and they also often yield better 
results. We present efficient algorithms for key recovery using IDHMM’s. Our 
methods can take advantage of multiple traces of the side channel and are in- 
herently robust to noisy measurements. Lastly, we apply IDHMM’s to analyze 
two randomized exponentiation algorithms proposed hy Oswald and Aigner. We 
completely recover the secret key using as few as ten traces of the side channel. 



1 Introduction 

Randomized countermeasures [1,2, 3,4, 5, 6, 7, 8] for side channel attacks [8,9,10] are a 
promising, inexpensive alternative to hardware based countermeasures. In order to gain 
strong assurance in randomized schemes, we need some way to analyze their security 
properties, and ideally, we would like general-purpose techniques. To this end, we present 
HMM attacks, a new type of cryptanalysis based on modeling countermeasures as Hidden 
Markov Models (HMM’s) [11]. We also introduce Input Driven Hidden Markov Models 
(IDHMM’s), a generalization of HMM’s. IDHMM’s are particularly well suited for 
analyzing any randomized countermeasure whose internal operation can be modeled by 
a probabilistic finite state machine. 

Hidden Markov Models (HMM’s) [11] are a well-studied model for finite-state 
stochastic processes. An execution of an HMM consists of a sequence of hidden, unob- 
served states and a corresponding sequence of related, observable outputs. HMM’s are 
memoryless: given the current state, the conditional probability distribution for the next 
state is independent of all previous states. The main problem of interest in HMM’s is 
the inference problem, to infer the values of the hidden, unobserved states given only 
the sequence of observable outputs. The Viterbi algorithm [12] is an efficient dynamic 
programming algorithm for solving this problem. 

At first glance, HMM’s seem perfect for analyzing randomized countermeasures 
which can be modeled by a probabilistic finite state machine: the hidden states of the 
HMM represent the internal states of the countermeasures and the observable outputs 
represent observations of the side channel. However, HMM’s have deficiencies which 
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Table 1. Summary of attacks on OAl and OA2, two randomized side channel countermeasures 
proposed by Oswald and Aigner. Note that our new attacks are the first to work even with a noisy 
side channel. 



Attack 


Relevant 

countermeasure 


Observation 
error (pe) 


Number of traces needed 
to recover the secret key 


Workfactor 


Okeya-Sakurai [13] 


OAl 


0 


292 


minimal 


C.D. Walter [14] 


OAl,OA2 


0 


2-10 


minimal 


HMM attacks (new) 


OAl, OA2 


0 


10 


minimal 


HMM attacks (new) 


OAl, OA2 


0 


5 




HMM attacks (new) 


OAl, OA2 


0.1 


10 




HMM attacks (new) 


OAl, OA2 


0.25 


50-500 





prevent them from being directly applicable. Firstly, HMM’s do not model inputs. 
HMM’s model processes as a sequence of states. However, the internal operation of 
a randomized countermeasure is likely to be dependent on both the current internal state 
as well as an input: the secret key. IDHMM’s extend HMM’s to handle inputs so we 
can accurately model randomized keyed countermeasures. Secondly, standard inference 
techniques like Viterbi’s algorithm cannot leverage multiple output traces of an HMM. 
However, the ability to handle multiple output traces will make our key recovery attacks 
more powerful. To address this, we present an efficient approximate inference algorithm 
for IDHMM’s that handles multiple output traces. 

To demonstrate how HMM attacks can be used in practice, we show how to break two 
randomized exponentiation algorithms proposed by Oswald and Aigner [2] . Previously 
known attacks [13,14] against these algorithms assume the ability to perfectly distinguish 
between elliptic curve point additions and doublings in the side channel. We present more 
powerful attacks which are robust to noise. A summary of our attacks in comparison to 
previous work is shown in Table 1 . 



2 Modeling Randomized Side Channel Countermeasures as 
Probabilistic Finite State Machines 

Many authors have proposed randomization as a way to limit the security risks from 
information leaked over side channels [1,2,3,4,5,6,7,8,15]. However, the security af- 
forded by randomization in this setting is not clear. Side channel attacks are typically 
successful because of the high correlation between the information leaked over the side 
channel and the internal state of the device, most notably a secret key used in various 
cryptographic operations. The hope behind randomized countermeasures is that the side 
channel information will become randomized as well, thus making it harder to analyze. 
An ideal randomized countermeasure would completely disassociate the side channel 
information from the internal state of the device, or more formally, for any set of mea- 
surements of the side channel, the likelihood of an adversary guessing any information 
about the internal state of the device would be the same as if the adversary had observed 
no side channel information at all. Some examples of randomized countermeasures in- 
clude randomized exponentiation algorithms [1,2,3, 4], random window methods [5,6], 
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Input: k, M Output: k x M 

Q = M 
P - 0 

for 2 = 1 to AT 

if (ki 1) then P = P -\- Q 

Q - 2Q 
return P 



(a) The Binary Algorithm for 
ECC scalar multiplication. 



Input: k, M Output: k x M 

Q^M 
P = 0 

for 2 = 1 to N 
P 

b — rand-bit{) 
if (fcj — = 0) then 

if (h 1) then 

R — P + Q // result is discarded 

else 

P- P + Q 
Q - 2Q 

return P 



(b) A randomized variant of the Binary 
Algorithm for ECC multiplication. 



Fig. 1. Introducing randomness into the Binary Algorithm for ECC scalar multiplication. 



randomized instruction execution [7], randomized timing shifts [8], randomized blinding 
of the secret key [15], and randomized projective coordinates [15]. 

We introduce a new cryptanalytic technique based on Hidden Markov Models to 
analyze such randomized countermeasures. To help give the intuition behind our analysis, 
we first give a simple example of a fabricated countermeasure that uses randomization, 
show how to model its operation using a probabilistic finite state machine, and then 
motivate the use of Hidden Markov Models to analyze its security. 

2.1 A Simple Randomized Countermeasure 

Consider the randomized variant of the standard binary algorithm for doing scalar mul- 
tiplications over elliptic curves shown in Figure 1(b). Assume k = k^k^-i . . . ^ 2^1 is 
the N bit secret key and M and P are points on the elliptic curve. 

The major difference between the algorithm in Figure 1(b) and the standard Binary 
Algorithm is as follows: in each iteration, if the next key bit is 0, then with probability 
1/2 our algorithm will execute a discarded spurious addition, but if the next key bit is 
1, it behaves the same as the standard Binary Algorithm. This randomized variant of 
the Binary Algorithm is completely artificial and by no means secure. It was created 
solely to demonstrate how randomness might be used in the construction of side channel 
countermeasures and will serve as a running example to illustrate our techniques. 

Now, assume that it is possible for an adversary observing the side channel to distin- 
guish between elliptic curve point additions and elliptic curve point doublings in a single 
scalar multiplication. Then the adversary’s observation of a single scalar multiplication 
can be represented as a sequence {yi,y2, ■ ■ ■ , Vn), Vi G {D, AD}, where D represents 
an elliptic curve point doubling and A represents an elliptic curve point addition. Each 
yi represents the operations observed during the processing of a single bit of the key. 
We refer to such a sequence as a trace. 

Note there is no longer a one-to-one correspondence between each possible trace and 
each possible key. Rather, each given sequence of observable operations is consistent 
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Fig. 2. A probabilistic finite state machine that models the operation of the randomized exponen- 
tiation algorithm in Figure 1(b). 



with several possible keys. For example, if the trace from a scalar multiplication using 
the algorithm in Figure 1(b) is {AD, AD, D), then there are four possible keys consistent 
with this trace: namely, Oil, 001, 010, or 000. 

2.2 Probabilistic Finite State Machines 

Although there are clearly many ad-hoc ways to break the algorithm in Figure 1(b), its 
primary purpose is to illustrate the development of a general technique for analyzing 
randomized countermeasures. Several weaknesses have been discovered in some exist- 
ing randomized countermeasures [13,14,16], but the analysis techniques used are often 
specific to the particular countermeasure, and it is not obvious how to generalize them 
to a framework applicable to a larger class of algorithms. The primary benefit of a gen- 
eral analytical framework is that it enables the analysis of a large class of randomized 
countermeasures while minimizing the overhead needed to analyze any one particular 
algorithm. Although such a framework by itself may not in general be able to prove 
the security of every conceivable countermeasure, it can help quickly determine if a 
countermeasure is insecure, potentially saving a cryptanalyst many hours of work. 

A key component for a general analytical framework is a good operational model of 
the countermeasures. A simple model applicable to many randomized countermeasures 
is a probabilistic finite state machine. The resulting finite state model for our running 
example can be easily constructed from its algorithmic description and is shown in Figure 
2. Each state corresponds to a full iteration of the loop in Figure 1(b) (i.e, the processing 
of one key bit in its entirety) and is labeled with the operations {D or AD) that may 
be observed when that state is visited. Each edge is annotated with a bit from the key 
and a probability. In general, one of the states would be designated as initial, but in this 
example any state can serve as the initial state. The model of execution is simple: given 
the current state qi and the next bit of the key ki+\, the next state is determined 
probabilistically according to the probabilities on those outgoing edges of qi that are 
annotated with fci+i. 
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While the edges capture the control structure of the algorithm, the states abstract away 
the details of the calculation. However, the label on each state indicates the observable 
information that leaks through the side channel when the process enters the state. In 
this example, the observations are what type of operations (elliptic curve point additions 
and/or doublings) are executed while in a particular state. Note however, since the state 
machine is randomized, some particular traces could arise from several different paths 
through the machine. For example, the trace {AD, AD, D) could arise from any of the 
following paths through the state machine in Figure 2: S 2 S 2 S 0 , S 2 SiSo> SiS 2 So> or siSiSq- 
More formally, we define a probabilistic hnite state machine to be a sextuple 

M = {S,I,5,0,so,ii) 



where 

S' is a finite set of internal states, 

/ is a hnite set of input symbols, 

(5 : S X S X / — >■ [0, 1] is a function called the transition function, 

O is a hnite set of symbols that represent operations observable over the side channel. 
So G S is the initial state, 

/i : S — O is a function associating an observable operation with every state, 
and the following condition is satished: 

Vsj G S', V& G I, ^ 5{si, Sj,b) = 1 . 

BjGS 



In our setting, the set of input symbols is / = {0, 1}, representing the bits of a secret 
key. 

For a key k = k]\[kN-i ■ ■ ■ ^ 2 ^ 1 , dehne an execution q of M = {S, S, O, sq, /t) 
on /c to be a sequence q = (qg, qi, . . . , qjv-i,qN), where qg = sg and qi G S, such 
that for 0 < i < n, 9i+i, > 0. Dehne a trace y of an execution g to be a 

sequence y = {yi,y 2 , ■ ■ ■ , 2/Af), where y^ = y{qi),\/i ^ 0. In the case of our example, 
an execution corresponds to the sequence of internal states traversed during a scalar 
multiplication of the secret key k, and the trace of that execution represents the sequence 
of observable elliptic curve point additions and doublings. 

2.3 The Key Recovery Problem for Probabilistic Finite State Machines 

Since one of the primary goals of side channel attacks is to recover the secret key stored 
within a target device, we wish to solve the following problem: 



Key Recovery Problem for Probabilistic Finite State Machines 

Let M be a probabilistic hnite state machine. Generate a random N bit key k and 
an execution <7 of M on k. Let y be the trace of q. The Key Recovery Problem for 
probabilistic hnite state machines is to hnd k given M and y. 
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One approach to solving the Key Recovery Problem for probabilistic finite state 
machines is the following: 1) Given a trace y and machine M, try to infer the execution 
q it resulted from, and then 2) infer k from q. Step 2 becomes easy if we restrict M to 
he faithful. A probabilistic finite state state machine M = {S, 6, O,so,if) is said to be 
faithful if it satisfies the following property: 

ysi,Sj G S, if 5{si, Sj,0) > 0, then 6{si, Sj, 1) = 0 . 

For faithful machines, there is a one-to-one correspondence between an execution q and 
the key k used in that execution. This is because for every pair of consecutive states 
Si , Sj in an execution, there is no ambiguity in what bit annotated the corresponding 
directed edge that was taken from Si to sj. Note that this condition does not limit 
the expressiveness of our framework by restricting M in any significant way. If for a 
machine M there exists Sj, Sj such that SM{si, Sj,0) > 0 and SM{si, sj, 1) > 0, then an 
observationally equivalent machine M' can be constructed that is identical to M except 
state Sj is replaced by two states such that <5jvr'('5ij 0) = ^M(si, Sj,0), 

^ Sj^ j 1') — 0, Sj[^fsi,Sj2,0') — 0, and Sj\4^fsi,Sj2,i.) — ^Af(si, sj , 1') . Thus, 
without loss of generality, we will only consider probabilistic finite state machines that 
are faithful. 

2.4 The State Inference Problem for Probabilistic Finite State Machines 

We define the State Inference Problem for probabilistic finite state machines as follows: 



State Inference Problem for Probabilistic Finite State Machines 

Let M be a probabilistic finite state machine. Generate a random N bit key k and 
an execution g of M on k. Let y be the trace of q. The State Inference Problem for 
probabilistic finite state machines is to find q given M and y. 



Because of the one-to-one correspondence between q and the key k used in that 
execution, solving the State Inference Problem for M and y is equivalent to solving the 
Key Recovery Problem for M and 

One way an adversary might try to solve the State Inference Problem for M = 
{S, 1,6,0, So, y) and y of length N is to treat the unknown execution q as a random 
variable Q with sample space and use maximum likelihood decoding. A simple 

implementation of maximum likelihood decoding involves two steps: 

Input: trace y, machine M 

1. Calculate Pr[Q = s|y], for each s G 

2. Output q = argmaxPr[Q = s|y]. 

The adversary’s output is the most likely execution q for the given trace y, which then 
yields the most likely key k. 

* Although we have formulated both problems in a way that implies deterministic solutions, 
randomized algorithms with significant success probability are acceptable as well. 
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A naive implementation of step 1 will have a running time exponential in the length of 
the trace. However, we will see how to transform a probahilistic finite state machine into 
a Hidden Markov Model, in which there is an equivalent State Inference Problem with a 
polynomial running time solution. In addition to having efficient inference algorithms, 
we will see that Hidden Markov Models have other advantages as well. 

In this section, we introduced probabilistic finite state machines, an intuitive tech- 
nique for modeling the operation of randomized countermeasures, and we demonstrated 
the use of the model with an artificial yet instructive example. In the remainder of this 
paper, we will show how Hidden Markov Models not only provide a sound, well-studied 
framework, but also how they can be extended into even more powerful and flexible 
cryptanalytical tools for analyzing randomized countermeasures. 

3 Assumptions 

Before formally describing our analytical framework for randomized countermeasures 
using Hidden Markov Models, we will make our assumptions more precise. Our analysis 
depends on the following assumptions: 

- We have collected a set of L traces from the side channel, corresponding to L 
executions of the countermeasure, all using the same secret key. 

- Each trace of the side channel can be uniquely written as (yi, j/ 2 , ■ • ■ ,Un) where 
each Tji is an element of some finite observation set O. In the example presented in 
Section 2,0 = {D, AD}. 

- The operations in O can be probabilistically distinguished from each other. 

- Each observation yi from O can be associated with the processing of a single key 
bit position, and vice versa. 

If the attacker is lucky, the side channel reveals exactly which action from O has 
been taken, and thus the observation traces (yi, j/ 2 , • • • , Vn) are free of errors. This is 
the model some previous work has used, and it does simplify analysis. However, this 
assumption may not always be realistic. 

In the more general case, observations may only yield partial information on the 
actual trace, hence our measurements may contain errors. As we will see in Section 4, 
our techniques are still applicable in this setting. When there are only two different types 
of observations, a simple model of this behavior is that each observation has probability 
1 — Pe of being correct and probability of being mischaracterized. This setting may 
be more realistic in practice, particularly for devices that try to make all operations look 
alike. 

4 Input Driven Hidden Markov Models as a Model for 
Randomized Side Channel Countermeasures 

In Section 2, we outlined an approach for analyzing randomized countermeasures which 
infers the most likely secret key from the sequence of observable operations. However, 
this approach is not only intractable, but has other deficiencies as well. Four main chal- 
lenges remain: 
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Fig. 3. An execution of a Hidden Markov Model, represented in a probabilistic graphical model. 
This figure depicts one execution of the HMM. Each node represents a random variable, and 
the directed edges indicate conditional dependencies. A shaded node indicates the corresponding 
variable is observed (i.e. outputs we can observe), while unshaded nodes are unobserved (i.e. what 
we wish to recover). 



1 . Efficient inference algorithms are needed. A naive implementation of maximum 
likelihood decoding for a single trace has running time exponential in the length of 
the trace. In order to be useful, inference algorithms must scale better. 

2. Side channel measurements may he noisy. As we mentioned in Section 3, our 
measurements of the side channel may be noisy and contain errors. It is desirable 
to have techniques that tolerate noise. 

3. We need a model that handles inputs. Hidden Markov Models will serve as a 
starting point for our techniques, but HMM’s only have outputs and do not model 
inputs. In order to accurately model the secret key, we need a framework that models 
processes with both inputs and outputs. 

4. One trace is typically not enough. For any reasonable countermeasure, the set 
of possible keys consistent with a single trace will be large. Hence, attacks that 
examine only a single trace are unlikely to he successful. However, hy gathering 
multiple traces that result from use of the same key, we may he able to narrow the 
list of likely candidates. Thus, it is desirable to have techniques that can analyze an 
arbitrary number of traces. This will make our analysis both more general and more 
powerful. 

First, we will show how Hidden Markov Models can be used to solve problems 1 and 
2. Then, we introduce an extension to HMM’s, Input Driven Hidden Markov Models, 
that address problems 3 and 4. 



4.1 Hidden Markov Models 

Hidden Markov Models (HMM’s) [11] are a well-studied method for modeling finite- 
state stochastic processes. The word “hidden” indicates that the states of the process are 
not directly observable. Instead, related to each state is an output which is observable. One 
of the main problems of interest for Hidden Markov Models is the inference problem: to 
infer the most likely sequence of values for the hidden states given only the observations. 
Since there exist efficient algorithms for solving the inference problem in HMM’s, this 
motivates trying to model randomized countermeasures as HMM’s in a way so that the 
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key recovery problem for a randomized countermeasure becomes the inference problem 
in an HMM. 

HMM’s induce two sequences of finite random variables: the hidden states, Qi, 
Q2, ■ • ■ , Qat, and the observations^, Yi,Y2, . . . , Yat- Like regular Markov Models, the 
value of the next state is dependent only on the current state and not any previous states. 
That is, the distribution of Q„ is conditionally independent of Qi,Q2, ■ ■ ■ , Qn-2 given 
Qn - 1 • In addition, it is assumed that the distribution of Y„ is conditionally independent 
of everything else given Q„. 

HMM’s are parameterized by the local conditional distributions Pr[Qn|Qn-i] and 
Pr [Y„ I Q„], both of which are assumed to be independent of n. If S' = {si,S 2 , . . . ,sm}, 
the conditional distribution Pr[Qn|Qn-i] is parameterized by a M x M transition 
matrix A, where Aij = Pr[Qn = Sj\Qn-i = Si]. Since in our setting the sample 
space of the observations Y„ is a finite observation set O = { 01 , 02 ,... ,oj}, the 
conditional distribution Pr [1^ | Q„] is parameterized by a M x J output matrix B, where 
Bj^j — Pr[Y,^ = Oj\Qn ~ -Si]- 

In summary, for our setting, a Hidden Markov Model is defined by the quintuple 

H = {S,0,A,B,so) 



where 

S is a finite set of internal states, 

O is a finite set of symbols that represent operations observable over the side channel, 
H is a jS”! X [S'! matrix where Aij = Pr[Q„ = Sj\Qn-i = s^], 

H is a [S'! X | 0 | matrix where Bij = Pr[Y„ = Oj\Qn = Si], 

So C S' is the initial state. 

We refer to a realization qi,q2, . . . ,qN of the random variables Qi, Q2, ... ,Qn 
as an execution of the HMM, and to a realization yi,i/2, ■ ■ ■ ,yN of the random vari- 
ables Yi,Y 2 , . ■ . , Y/v as a trace of the HMM. Recall that traces are observable whereas 
executions are generally not. 

Hidden Markov Models have a graphical representation as the probabilistic graphical 
model [11] shown in Figure 3. Probabilistic graphical models are a graphical represen- 
tation of a set of random variables where each node represents a random variable and 
the directed edges indicate conditional dependencies. A shaded node indicates the cor- 
responding variable is observed, while unshaded nodes are unobserved. In the case of 
HMM’s, the shaded nodes correspond to the observations Y„, and the unshaded nodes 
correspond to the hidden, unobserved states Q„. 

Consider a probabilistic finite state machine with no inputs, i.e., with no keys. That 
is, for M = (S', (5, 0, So, /x), the transition function is given by i5 : S x S — >■ [0, 1] rather 
than (5 : S X S X / — >■ [0, 1]. If we wish to infer the most likely execution for M given 
a trace y of length N, we can construct a Hidden Markov Model H = (S, 0,A,B, sq) 
where the unknown states in the execution can be modeled by the random variables 
Qi,Q 2 , ■ ■ ■ , Qn and the corresponding observations are realizations of Yi , Y2 , . . . , Y^ . 

^ HMM’s can handle real-valued observations as well. 
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The transition matrix A can be easily constructed from S, and the output matrix B is 
completely deterministic, i.e., each row of B has one 1 and (|0| — 1) O’s. 

Solving the State Inference Problem for M and y reduces to solving a similar state 
inference problem for H and y. The State Inference Problem for a Hidden Markov Model 
H given a trace y = {yi,y 2 , ■ ■ ■ , yn) is as follows: 

State Inference Problem for Hidden Markov Models 

Let H be Hidden Markov Model. Generate an execution qof H and let j/ be a trace 
of q. The State Inference Problem for Hidden Markov Models is find q given H 
and y. 

The standard approach for finding q in the State Inference Problem for HMM’s is 
maximum likelihood decoding. However, as we have seen before, naive implementations 
for finding 

q = argmaxPr[Q = = y] 

ses^ 

will have running time exponential in N. However, the Viterbi algorithm [12] is a well- 
known dynamic programming solution to the State Inference Problem for HMM’s with 
running time • N). This addresses the first challenge, the need for efficient 

inference algorithms. 

HMM’s can also address the second problem of noisy side channel measurements. 
This can be handled by proper parameterization of the distribution Pr[y„|(5„]. For 
example, in the toy example presented in section 2, for perfect observations, Pr[l^ = 
AD\Qn = S 2 ] = 1. Noisy observations can be modeled by assuming observations 
are only probabilistically correct, e.g., Pr[y„ = AD\Qn = S 2 ] = 0.7 and Pr[y„ = 
D\Qn = S 2 ] = 0.3. 

4.2 Input Driven Hidden Markov Models 

HMM’s are not completely adequate for modeling most countermeasures. In most coun- 
termeasures, the next state depends not only on the current state, but also on the next bit 
of the key. In the context of HMM’s, we would like the key to serve as a sort of input 
to the HMM. However, HMM’s unfortunately do not have inputs. Therefore, we extend 
the notion of HMM’s to include the possibility of inputs by introducing Input Driven 
Hidden Markov Models (IDHMM’s). 

IDHMM’s extend HMM’s in two fundamental ways. First, the unknown input is 
treated as a random variable K = {Ki,K 2 , ... , TTat) such that is input to the 
underlying HMM at step n. The local conditional distribution are updated to reflect 
this, i.e., we replace Vr[Qn\Qn-i\ with Pr[Q„|(5„_i, iT„]. Second, since one of the 
motivations behind developing IDHMM’s was to analyze multiple traces, we need 
to add additional random variables to model additional execution/trace pairs. Thus, 
. . . , will represent a list of L traces, where = (Y^.Y ^, . . . , Y}^). Also, 
. . . , will represent the corresponding L sequences of hidden states, where 
Q* = {Q\^ Q 21 ■ ■ ■ j Qn)- IDHMM’s assume the same input is used in every execution, 
which corresponds exactly to the assumption that the same key is used in every execution 
of the countermeasure. 
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Fig. 4. Input Driven Hidden Markov Models. This figure depicts one execution of an IDHMM on 
input Ki, K 2 , . . . , Kn- 



The graphical model shown in Figure 4 represents a single execution of an IDHMM. 
The input was applied a single time to a produce a single output trace. Figure 5 shows 
a graphical model representing L traces from L executions of an IDHMM in which the 
same input is applied in each execution, for some L > 1. 

An Input Driven Hidden Markov Model is defined by the septuple 

H ={S,I,0,A,B,C,so) 



where 

S' is a finite set of internal states, 

/ is a finite set of input symbols, 

O is a finite set of symbols that represent operations observable over the side channel, 
A is a |S| X |/| X |S| matrix where Aijk = 'PriQn = Sk\Qn-i = Kn = kj], 

B isa |S| X |0| matrix where Bij = Pr[T^ = oj\Qn = ■5i]> 

C is a TV X |/| matrix representing the prior distributions for {Ki, K 2 , . . . , AT^v), 
where Cjk = Pr[Ay = ik], 

So C S is the initial state. 

Since in our setting the input is a binary key chosen uniformly at random, the set of input 
symbols is 7 = {0, 1} and the prior distributions are Pr[7f„ = 0] = Pr[7f„ = 1] = 0.5. 

Our final goal is the inference problem for IDHMM’s: we want to infer the input key 
K rather than the sequences of hidden states Q. We define the Key Inference Problem 
for Input Driven Hidden Markov Model as follows: 

Key Inference Problem for Input Driven Hidden Markov Models 

Let H be an Input Driven Hidden Markov Model. Generate a N bit random input 
key k and L executions q = . . . ,q^) of H on k. Let y = ... ,y^) 

be the corresponding L traces. The Key Inference Problem for Input Driven Hidden 
Markov Models is to find k given H and y. 
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Fig. 5. Modeling Multiple Executions of an Input Driven Hidden Markov Model. This figure 
depicts L executions of an IDHMM on input Ki,K2, ... ,Kn- 



Ideally, we would like to compute 

k= argmaxPr[itr = k\Y = ,y^)] . {*} 

feG{0,l}" 

However, we do not know how to compute this efficiently. Therefore, we introduce an 
approximation: we infer the posterior probabilities for each bit of the key separately, 
and then we use the most likely value of each bit to infer the entire key. This amounts to 
computing 

k={kN,kN-i,- ■ ■ ,k2,ki), where fc„ = argmaxPr[it:„ = b\Y = (y\t/^,... ,y^)] . 

bG{0,l} 

However, even this was too hard for us. Our first attempts at an algorithm to calculate 
Pr [Kn\y] using dynamic programming in a manner similar to that in the inference 
algorithm for HMM’s encountered a significant problem: the resulting algorithm had 
running time exponential in L, the number of traces. Since our goal is to scale with the 
number of traces, this is unacceptable. 
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SlNGLETRACElNFERENCE(_ff, D,y')\ 

Input: An IDHMM H, a distribution D, assumed Pr[K = k] = D{k), and a trace 
y' = ,Vn) 

Output: A distribution D' , where D'{k) = Vr\K = k\Y' = y'], using D as our priors on K 
1) Use a modified version of the Viterbi algorithm to compute D'{k) := Pr[A' = k\Y' = y'], 
assuming D{k) = Vr\K = k\. Refer to the full version of this paper [17] for details. 

MultiTraceInference(77, y)-. 

Input: An IDHMM H and a set j/ = {y^ ,y"^, ... , j/^) of L traces 
Output: Dl, an approximation to the distribution Pr[K = k\Y = y] 

1) Let Do := the uniform distribution on {0, 1}^. 

2) for i := 1, 2, . . . , L do 

Di := SingleTraceInference( 77, Oi_i, j/*) 

3) Output Dl- 

lNFER(iL, y): 

Input: An IDHMM H and a set j/ = j/^, . . . , j/^) of L traces 

Output: k, a guess at the key 

1) Let Pr[Ki\Y = y] be as given by MultiTraceInference(J7, y). 

2) tor i := 1,2, . . . , N do 

\tVr[Ki = 1\Y = y]> 0.5 then ki ~ 1 else ki 0 

3) Output fcguess = fcjvfciV-1 . . . k 2 k\. 



Fig. 6. An approximate inference algorithm using belief propagation. Given a set of traces y = 
{y^ ,y'^, ... ,y^) of an Input Driven Hidden Markov Model H, we compute a guess fcguess = 
Infer(L7, y) at the key. 



To deal with these challenges, we introduce a new technique based on belief propa- 
gation. The key idea is to separate L executions of an IDHMM on the same input into L 
executions of an IDHMM where there are no assumptions about the input used in each 
execution. In terms of the graphical models, this corresponds to transforming Figure 5 
into L copies of Figure 4. We can derive an efficient exact inference algorithm for a 
single execution of an IDHMM with running time OdS'p • N). By applying this exact 
inference algorithm separately to each of the L executions, we obtain an algorithm with 
final running time of 0(|S'd • TV • L). 

The problem with this approach is that we are not taking advantage of the fact that 
the executions all use the same key. Using L traces as input, this approach will output 
L separate inferences of the key, each derived independently of the others. We can link 
them by using belief propagation: instead of using the uniform distribution as our prior 
Pr[Kn] for each key bit in the L analyses, we use the posterior distribution Pr[iT„|yd 
calculated from analysis of the Lth trace as the prior distributions Pr while analyzing 

the I + 1-st trace. Hence, we propagate any biases on the key bits learned from the Lth 
trace to the analysis of the I -f 1-st trace. A detailed description of our belief propagation 
algorithm for inferring the secret key from L traces of an IDHMM is shown in Figure 6. 
Although the output of this algorithm is only an approximation to what we ideally want 
in (4.2), we have found that it works well in practice. 
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5 Application to Randomized Addition-Subtraction Chains 

Oswald and Aigner have proposed two randomized exponentiation algorithms [2] for 
scalar multiplication in ECC implementations. These algorithms are based on the ran- 
domization of addition- subtraction chains. For example, instead of the usual binary 
decomposition 15P = 8P + 4P + 2P + 1, 15P can alternatively be calculated as as 

15P = 16P - P = 2(2(2(2(P)))) - P. 

More generally, a series of more than two I’s in the binary representation of k can be 
replaced by a block of O’s and a —1, i.e., 01“ i— >■ where I represents —1. A 

second transformation noted by Oswald and Aigner treats isolated O’s inside a block of 
I’s, i.e.,01“01*' 

Both of these transformations can be modeled by deterministic finite state machines. 
Oswald and Aigner construct two randomized exponentiation algorithms by introducing 
randomness into these state machines while still preserving the end semantics of the two 
transformations. At each step where a transformation may apply, we flip a coin to decide 
whether or not to apply that transformation. We refer to the randomized construction 
based on the transformation 01“ i— 10“~^1 as OAl and the randomized construction 
based on the transformation 01“01^ i— >■ 10“10^~^1 as OA2. The randomized state ma- 
chine describing the operation of OAl (as it appears in [2]) is shown in Figure 7(a). 

The randomized state machines in [2] that describe the operation of OAl and OA2 do 
not conform to our definition of probabilistic state machines in Section 2, but this is easily 
remedied. The first hurdle is that traces cannot be parsed uniquely as words in {P>, AD}* . 
Although it would be convenient if our observable alphabet was O = {D,AD}, the 
transition from S 2 to si executes a doubling first and then an addition, resulting in a DA 
output symbol corresponding to that key bit. This is undesirable because traces fail to be 
uniquely decodeable: for example, DADD could be interpreted as either {DA, D, D) 
or {D, AD, D). We remedy this problem by interpreting the automaton in Figure 7(a) 
slightly differently. We relabel the DA transition from S 2 to si to simply a D (i.e., 
Q = 2Q) and now associate the “owed” addition with each outgoing transition from 
state si. Our output alphabet becomes O = {D, AD, AAD}, and then each sequence 
of D and A operations can deterministically be decomposed into a sequence of symbols 
from O. The resulting state machine is shown in Figure 7(b). 

A second hurdle is that Oswald and Aigner place observable operations on the edges, 
rather than on the states. Fortunately, edge- annotated state machines can easily be trans- 
formed into a semantically equivalent state-annotated machine (of the type defined in 
Section 2) by treating each edge in Figure 7(b) as a state in the probabilistic FSA. This 
yields a faithful probabilistic finite state machine to which our algorithms can be applied. 
See Figure 7(c) for the result of this process applied to OAl. 

Once we have probabilistic finite state machine representations for the countermea- 
sures, applying our techniques is straightforward. We simulated the operation of both 
exponentiation algorithms in software. First, we generated a random 192 bit key k. Using 
k, we then generated a set of traces y = {y^,y‘^, . . . , y^). We introduced errors in the 
traces consistent with observation error pe. With probability 1 — pe, each output symbol 
is observed correctly, and with probability pe, it is changed to some other output sym- 
bol (chosenly randomly). We assumed the error probability pe is known to the attacker. 
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Fig. 7. The first Oswald-Aigner construction (OAl) 
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and we incorporated it into the output distribution (i.e., Pr[y„ | Q„]) of the resulting 
IDHMM. Treating the OAl or OA2 countermeasure as an IDHMM driven by k, we then 
applied the Infer algorithm from Figure 6 to compute 

^guess — (fcjv, kN-i, ■ ■ • , fe, fci), where 
kn = argmaxPr[A:„ = b\ Y = . . . , y^)] . 

be{o,i} 

The following table summarizes the results of our attacks against OAl and OA2: 



Number of key bits correctly recovered 



Countermeasure 


Pe 


1 


Nurr 

5 


iber 1 
10 


of tr; 
25 


ices 

50 


used 

100 


500 


OAl 


0 


170 


187 


192 


192 


192 


192 


192 


OAl 


0.1 


157 


178 


184 


185 


187 


192 


192 


OAl 


0.25 


143 


163 


173 


180 


182 


183 


184 


OAl 


0.4 


120 


147 


159 


168 


172 


173 


174 


OA2 


0 


165 


188 


192 


192 


192 


192 


192 


OA2 


0.1 


156 


174 


184 


187 


189 


192 


192 


OA2 


0.25 


135 


161 


174 


177 


180 


181 


182 


OA2 


0.4 


126 


146 


154 


168 


171 


172 


173 



Each entry in the table specifies the number of key bits (out of 192) that we correctly 
recovered using the corresponding number of traces and given observation error pe. 

Both OAl and OA2 are clearly insecure under our assumptions in Section 3. With a 
perfect side channel (pe = 0)> we recovered the entire secret key perfectly with as few 
as 10 traces; also, our techniques remain effective in the presence of noise. 

One can also reduce the number of traces by combining our attack with a semi- 
exhaustive attack over the most likely key candidates. It suffices to recover 182 of the 
192 key bits correctly, on average; then we can apply a meet-in-the-middle search over all 
possible 10-bit error patterns to identify the correct private key, using 2^® work. Hence, 
with a 0.1 probability of observation error, the entire key can be recovered with only 10 
traces, and for a 0.25 error probability, the data complexity increases to 50-500 traces. 



6 Related Work 

Several authors have analyzed the security of selected randomized countermeasures 
against side channel attacks [13,14,16], including the Oswald-Aigner constructions. 
However, the analysis techniques previously used have been ad-hoc in sense that they 
are tailored specifically to the countermeasure being analyzed, and it is not clear how to 
generalize them to analyze other randomized countermeasures (if this is even possible). 
In contrast, our techniques are broadly applicable to randomized countermeasures whose 
operation can be modeled by a probabilistic hnite state machine. 

Based on a comprehensive case analysis, Okeya and Sakurai [13] present an attack 
against OAl that with high probability recovers a 192 bit key using approximately 292 
traces of the side channel. They assume the ability to perfectly distinguish between 
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elliptic curve point additions and doublings and do not consider the case when the side 
channel is noisy. 

C.D. Walter [14] presents an attack against OA2 based on a detailed analysis of its 
operation. With high probability, his attack recovers a 192 bit key using 0(10) traces 
of the side channel. This attack can be generalized to work against OAl as well. Then, 
Walter also discusses how to partition traces into smaller subsections and exhaustively 
search (independently) for the key corresponding to each subsection. Depending on the 
key size, it is possible for his second technique to succeed with as few as two traces. 
Both his attacks assume the ability to perfectly distinguish between elliptic curve point 
additions and doublings. 

Song et al. [ 1 8] use Hidden Markov Models to exploit weaknesses of the widely used 
SSH protocol. By observing the inter-keystroke timings of a user’s key presses during 
an SSH session, the authors are able to recover significant information about the key 
stroke sequences. They use this technique to speed up exhaustive search for passwords 
by a factor of 50. Other than the work of Song et ah, we are not aware of any previous 
work that uses HMM’s for side channel cryptanalysis. 



7 Conclusion 

We introduced HMM attacks, a general-purpose cryptanalysis technique for evaluating 
the security properties of randomized countermeasures whose operation can be modeled 
by a probabilistic hnite state machine. We also introduced Input Driven Hidden Markov 
Models, an extension of HMM’s that model inputs, and we presented efficient approxi- 
mate inference algorithms for recovering the input to an IDHMM given multiple output 
traces. Our work improves on existing attacks against randomized countermeasures in 
two fundamental ways. Firstly, previous attacks against randomized countermeasures 
typically consist of detailed case analyses which are not clear how to generalize to 
attacks on larger classes of countermeasures. We present a cryptanalytical framework 
applicable to a general class of randomized countermeasures. Secondly, previous attacks 
against randomized countermeasures assume the ability to perfectly distinguish between 
operations in the side channel. Our techniques are still applicable if the side channel is 
noisy. 

We demonstrate the application of HMM attacks and IDHMM’s in an analysis of 
Randomized Addition-Subtraction Chains proposed by Oswald and Aigner. When our 
observations of the side channel are perfect, we are able to completely recover the secret 
key using as few as 5-10 traces. Our attacks are robust to noise in the side channel as 
well. For instance, when the probability of each observation being incorrect is 0.25, we 
are still able to recover the secret key by using 50-500 traces. 
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Abstract. Field Programmable Gate Arrays (FPGAs) are becoming 
increasingly popular, especially for rapid prototyping. For implementa- 
tions of cryptographic algorithms, not only the speed and the size of 
the circuit are important, but also their security against implementation 
attacks such as side-channel attacks. Power- analysis attacks are typical 
examples of side-channel attacks, that have been demonstrated to be 
effective against implementations without special countermeasures. The 
flexibility of FPGAs is an important advantage in real applications but 
also in lab environments. It is therefore natural to use FPGAs to assess 
the vulnerability of hardware implementations to power-analysis attacks. 
To our knowledge, this paper is the first to describe a setup to conduct 
power-analysis attacks on FPGAs. We discuss the design of our 
hand-made FPGA-board and we provide a first characterization of the 
power consumption of a Virtex 800 FPGA. Finally we provide strong 
evidence that implementations of elliptic curve cryptosystems without 
specific countermeasures are indeed vulnerable to simple power-analysis 
attacks. 

Keywords: FPGA, Power Analysis, Elliptic Curve Cryptosystems 



1 Introduction 

Since their publication in 1998, power-analysis attacks have attracted significant 
attention within the cryptographic community. So far, they have been success- 
fully applied to different kinds of (unprotected) implementations of symmetric 
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and public-key encryption schemes and on digital signature schemes. Most at- 
tacks which have been published in the open literature apply to software imple- 
mentations of cryptographic algorithms which can be found for example in smart 
cards (see [KJJ99], [MDS99a] or [MDS99b]). However, modern smart cards and 
accelerators for cryptographic algorithms also contain hardware implementations 
of cryptographic algorithms. 

As part of a modern design flow, FPGAs are gaining more importance. Rea- 
sons for this include their relatively low cost and the available tools. High-level 
descriptions (like VHDL for examples) for a circuit can easily be ported, if not 
directly used, for an FPGA implementation of the circuit. Naturally, it is de- 
sirable to use the resulting FPGA implementation also for an evaluation of the 
designed circuit against power-analysis attacks. 

This article describes the first realization of power-analysis attacks on a Vir- 
tex FPGA. We can prove that this FPGA leaks a significant amount of infor- 
mation about its internal computations through the supply lines. We can even 
provide evidence that the power consumption characteristics are comparable 
with the power consumption characteristics of ordinary ASIGs. To demonstrate 
how dramatic the power consumption leakage of this FPGA is, we finally per- 
form a simple power-analysis attack on an implementation of an elliptic-curve 
point-multiplication. 

The remainder of this article is organized as follows. We recall the princi- 
ples of power-analysis attacks in Sect. 2. FPGAs are introduced in Sect. 3. For 
the purpose of conducting power-analysis attacks, we built a special measure- 
ment board. This measurement setup is described in Sect. 4. The results of our 
experiments can be found in Sect. 5. Section 6 presents the conclusion of our 
research. 



1.1 Related Work 

The characterization of the power-consumption characteristics of FPGAs has 
received little attention so far. Shang et al. [SKB02] is the only recent article in 
that field. In their article, Shang et al. analyze the dynamic power consumption 
of the XILINX Virtex-H family. They conclude that 60% of the dynamic power 
consumption is due to the interconnects, 14% is due to the clocking, 16% is due 
to the logic and 10% is due to the lOBs. Based on this result, it seems much more 
difficult to conduct power-analysis attacks on FPGAs than on ASIGs. However, 
as we will demonstrate in this article, such attacks are feasible and can be realized 
in practice. 



2 Power-Analysis Attacks 

Power-analysis attacks are a very powerful type of side-channel attack, published 
first by Kocher et al. [KJJ99]. Power-analysis attacks are passive in the sense that 
an attacker only needs to measure the power consumption of a device without 
manipulating it actively, that is, an attacker uses the device in its intended 
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mode of use. A likely scenario (in the case of attacks on smart cards) is that an 
attacker lets the device execute an internal authenticate command. While the 
device is executing this command, the attacker measures the power consumed 
by the device. Statistical methods allow to extract efficiently the information on 
the secret key that is contained in the measurements. 

2.1 Power Consumption Characteristics of CMOS 

Nowadays, almost all smart card processors are implemented in CMOS (comple- 
mentary Metal-Oxid Silicon) technology. In CMOS technology, the values 0 and 
1 are represented by Vss and Vdd, respectively. The dominating factor for the 
power consumption of a CMOS gate is the dynamic power consumption [WE93] . 
Transition count leakage and Hamming weight leakage can typically be observed 
in CMOS circuits, see [MDS99b] for a detailed explanation. 

The power consumption behaviour of a CMOS processor can be roughly 
sketched as follows. On every rising edge of the clock, the simultaneous switching 
of the gates causes a current flow which is visible through both Vdd and Cgg. This 
current flow can be observed on the outside of the device by (for example) putting 
a small resistor between the devices Vss or Vdd and the true Vdd- The current 
flowing through the resistor creates a voltage which can be measured by a digital 
oscilloscope. 



2.2 Exploiting the (Hidden) Information 

Depending on how direct the information about the power consumption can be 
used, simple or differential power-analysis attacks (SPA or DPA) [KJJ99] have 
to be applied. SPA attacks are always possible when the power consumption is 
more or less directly related to the actions of the secret key. This is mostly the 
case when the instructions executed in the device give evidence about the secret 
key. If the instructions do not provide such information, but the processed data 
do so instead, then the information is typically more hidden in the overall power 
consumption and thus statistical methods have to be applied to bring them to 
light. This is the approach taken for differential power-analysis attacks. 



Simple Power- Analysis Attacks. Simple power-analysis attacks exploit the 
relationship between the instantaneous power consumption of a device and the 
instructions that are executed. For simple power-analysis attacks it is assumed 
that every instruction has its unique power-consumption trace. An attacker sim- 
ply monitors the device’s power consumption while it performs a cryptographic 
operation. Then, the attacker carefully studies the obtained power-consumption 
trace to determine the sequence of instructions performed by the device. If this 
sequence is directly related to the secret key which was involved in the cryp- 
tographic operation, the attacker can deduce this secret key from the power- 
consumption trace. Such an attack typically targets implementations which use 
key dependent branching in the implementation. 
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Differential Power- Analysis Attacks. An attacker faces now the task of 
exploiting the hidden information about the secret key in an efficient way. For 
this purpose, the attacker creates a hypothetical model of the device. This hy- 
pothetical model describes, at a very abstract level, the instantaneous power 
consumption of the device when it executes a certain cryptographic algorithm. 
For this purpose, at least a small part of the unknown key has to be guessed. 
Fortunately (for the attacker), all algorithms deployed in practice use only small 
parts of the secret key at a time. 

The attacker writes a simple computer program that executes the algorithm 
(or at least a small part of it, where a part of the key is used). The program 
calculates the result (of this part) for all possible key values. These values allow 
to predict the power consumption, which is for example related to the Hamming- 
weight of the internal data. 

In the last stage of the attack, an attacker feeds the same input values which 
he used in the model to the real device and measures its power consumption. 
Then the attacker correlates the predictions of the model with the real power 
consumption values. For all the wrong key guesses, the predictions will not cor- 
relate with the real measurements, but for the correct key guess, there will be a 
peak visible in the correlation trace. 

3 Field Programmable Logic Arrays 

An FPGA consists of an array of configurable logic blocks (CLBs), surrounded by 
programmable I/O blocks, and connected with programmable interconnections 
as shown in Fig. 1 [Opt]. A typical FPGA contains from 64 to tens of thousands 
of logic blocks and an even greater number of flip-flops. Most FPGAs do not 
provide a 100% interconnect between the logic blocks. Instead, sophisticated 
software places and routes the logic on the device. 
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Fig. 1. The FPGA architecture 
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Two main classes of FPGA architectures can be distinguished. Coarse- 
grained architectures consist of fairly large logic blocks, often containing two 
or more look-up tables and two or more flip-flops. Fine-grained architectures 
consist of a large number of relatively simple logic blocks. Another difference 
in the architectures is the underlying process technology used to manufacture 
the device. Currently, the highest-density FPGAs are built using static mem- 
ory (SRAM) technology, which is similar to microprocessors. The other common 
process technology is called anti-fuse, which has benefits for more plentiful pro- 
grammable interconnect. 

SRAM-based devices are inherently re-programmable, even in-system. After 
a power-up is applied to the circuit, the program data defining the logic con- 
figuration must be loaded in the SRAM [MKOlJ.The FPGA either self-loads 
its configuration memory, or an external processor downloads the memory into 
the FPGA. The configuration time is typically less than 200 ms, depending 
on the device size and configuration method. In contrast, anti-fuse devices are 
one-time programmable (OTP). Once programmed, they cannot be modified, 
but they also retain their program when the power is off. Anti- fuse devices are 
programmed in a device programmer either by the end user or by the factory 
or distributor. More details on the Xilinx Virtex Architecture are provided in 
Appendix A. 



4 The Measurement Setup 

Our setup consists of essentially two boards (see Fig. 2). The main board is 
responsible for interfacing the PG via the parallel port. It is connected with the 
XILINX parallel cable in order to program the VIRTEX FPGA and it provides 
some LEDs, switches and buttons for testing purposes. The daughter board itself 
just carries the VIRTEX FPGA, it allows to access some pins for triggering and 
to measure the power consumption of the VIRTEX FPGA in a convenient way. 



4.1 The Mother Board 

The Parallel Port [AxeOO] is the most commonly used port for interfacing home 
made projects. This port allows the input of 5 bits and the output of 12 bits. 
The port is composed of 4 control lines, 5 status lines and 8 data lines. The com- 
munication between the FPGA and the PG uses this parallel port. We need only 
17 input/output pins to send data or commands to the FPGA and receive the 
result, but we designed the board in such a way that it gives us more monitoring 
points and thus connected 32 input/output pins of the FPGA to the board. The 
unused input /output pins are pulled up after the configuration. 

We also designed a protocol to send and receive data to and from FPGA. 
When the FPGA communicates with the PG, it uses the three most significant 
bits of the status lines to indicate its status. The two remaining bits of status 
lines are used for sending the result from the FPGA to the PG. The protocol 
is independent from the operation executed in the FPGA. Only the length of 
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Fig. 2. The measurement setup. On the daughter board the current probe is connected 
to VCCINT. Alternatively it can be connected to the VCCO of the individual banks, 
or the GND. 



the data which is communicated can be modified by the PC. This provides a 
flexible setup where experiments with different algorithms can be performed in 
a coherent manner. 



4.2 The Daughter Board 

We use a Xilinx XCV800 FPGA from the Virtex series in a HQ240C package. 
Reasons for this particular choice include: 

1. The resources are sufficient to implement a 160-bit elliptic-curve point- 
multiplication. 

2. This is the most powerful FPGA that can be used for hand-mounting on 
the board. This is because the pins of this FPGA are on its sides. The more 
powerful FPGAs have the pins underneath with a grid structure and so 
special machines are needed to mount them. 

3. The architecture is made of combinational and memory elements. Because 
of this property it is a good representative of application specific integrated 
circuits (ASIGs). 

The XGV800 has 12 core voltage supply (VGGINT) pins, 16 output voltage 
supply (VGGO) pins and 32 ground (GND) pins. The FPGA is divided into 8 
banks each with their own VGGINT and VGGO pins. After the implementation 
of the desired circuit and the configuration of the FPGA with the implementation 
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data, some banks will be used more frequently than others; these banks should 
draw more current from their supply lines. With our setup it is possible to verify 
this hypothesis. In case that different parts of a design (such as an elliptic-curve 
addition and an elliptic-curve doubling) are mapped to different banks of the 
FPGA, measuring the current of the individual banks allows us to take more 
precise measurements for them. By measuring VCCINT and VCCO of the same 
bank separately, we can detect the input /output and core activity timing and 
power consumption separately. 

Therefore we use three headers with two lines for VCCINT, VCCO and 
GND as shown in Fig. 2. During the normal operation of the board without 
measurement the two pins are connected by a jumper. When we want to measure 
the current flow from a specific bank, the associated jumper is replaced by a cable 
that is going through the hole in the current probe as shown in Fig. 2. 

This setup gives the possibility of making measurements on different points 
at the same time and makes it easy to modify the measurement point. 



Bypassing Considerations. With high-speed, high-density FPGA devices, 
maintaining signal integrity is the key to reliable, repeatable designs [Xil02]. 
Proper power bypassing and decoupling improves the overall signal integrity. 
Without it, power and ground voltages are affected by logic transitions and can 
cause operational issues. 

When a logic device switches from a logic one to a logic zero, or a logic 
zero to a logic one, the output structure is momentarily at a low impedance 
across the power supply. Each transition requires that a signal line be charged 
or discharged, which requires energy. As a result, many electrons are suddenly 
needed to keep the voltage from collapsing. The function of the bypass capacitor 
is to provide local energy storage. 

lOnE capacitors are placed between every VCCINT and VCCO pin of the 
FPGA and the nearest GND. Because we designed the setup in two different 
cards, the daughter card can be thought as a stand alone chip taking power 
from the mother board. Bypass capacitors had to be placed between the power 
supplies of the card and the GND line. 

5 Results 

We now describe the experiments conducted on the measurement setup described 
above. As the aim of our work was to build an alternative platform for power- 
analysis attacks, we decided to perform first some basic experiments to verify 
that the assumption which we usually make for such attacks are also valid for 
our FPGA setup. 

5.1 Power Consumption Characteristics 

As discussed in Sect. 2, we should be able to detect either transition-count leak- 
age or Hamming-weight leakage in our setup. This is because according to Sect. 3, 
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the CLBs consist of flip-flops (and other logic) which exhibit the power consump- 
tion characteristics of CMOS technology. The only problem can be that if the 
circuit, which we load into the FPGA, does not use all of the FPGAs resources, 
then the noise which is produced by the unused parts might be larger than the 
signal produced by the circuit. 




Fig. 3. Comparison between an idle bank, which corresponds to the white trace in this 
picture, and a bank which receives data (8 bits) and processes it, which corresponds 
to the dark trace in this picture. 



To evaluate the behavior of the FPGA we loaded a small circuit on one of the 
banks of the FPGA. Then, we measured the power consumption of the whole 
FPGA and, at the same time, the power consumption of an empty (idle) bank 
(see Fig. 3 for the power consumption traces). The overall power consumption 
(the dark trace) shows clearly peaks when data is transmitted and when the 
data is processed. The light trace however does not exhibit any peaks during 
the whole computation. This experiment confirms that idle parts of the FPGA 
will not influence the overall power consumption. Moreover, even the power 
consumption of a very small circuit (we used only 3% of the FPGA and of the 
FPGAs flip-flops) can be easily detected. 

With another simple set of experiments we confirmed that the amount of 
power consumed of the FPGA is linear in the number of switched flip-flops. We 
have designed registers of a specific size and loaded them on the FPGA. Then 
we let them repeatedly store 0 and 1 value and measured the FPGA’s power 
consumption. Fig. 4 and 5 illustrate that the power consumed by the 6000-bit 
register for storing all Is is about twice as high as the power consumed by the 
3000-bit register. 

A direct conclusion from such experiments is that the power consumption 
characteristics are essentially the same as of an ordinary GMOS circuit. Idle 
GLBs or even idle banks do not add too much noise to the overall power con- 
sumption. 




Power- Analysis Attacks on an FPGA - First Experimental Results 



43 




Fig. 4. Power consumption trace of a 3000- Fig. 5. Power consumption trace of a 6000- 
bit register. bit register. 



5.2 Attacking an Implementation of an Elliptic- Curve 
Point-Multiplication 

With the experience gained from these experiments, we attacked an implemen- 
tation of an EC point multiplication. We have implemented the arithmetic for 
a 160-bit prime field with a Montgomery modular multiplier (MMM) without 
final subtraction ([Mon85],[OBPV03], see Algorithm 1 for a description). 



Algorithm 1 Montgomery modular multiplication without final subtraction 


Require: Integers N = (nj-i • • • nino) 2 , x = (xi ■ ■ ■ xiXo) 2 , y = (yi ■ 


• • yiyo )2 with 




X e [0, 2A - 1], y € [0, 2N - 1], R = 2'+C gcd{N, 2) = 1 and N' = 


-iV"i mod 2 




(Notation T — 




Ensure: xyR mod 2N 




1 


T ^ 0 




2 


for i from 0 to / -|- 1 do 




3 


rrii (to + Xiyo) N' mod 2 




4 


T i — (T -f Xiy -f TriiN') /2 




5 


end for 




6 


Return (T) 





To obtain a linear, pipelined modular multiplier, a systolic array shown in 
Fig. 6 is used [Wal99]. A'(O) denotes the least significant bit (LSB) of the register 
in which the input x is stored. T denotes the intermediate value register. The 
carry chain is stored in the CO and Cl registers. The Montgomery modular 
multiplication circuit (MMMC) consists of a controller and a data path. The 
data path consists of a systolic array, four internal registers, a counter and a 
comparator. 
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Fig. 6. Schematic view of the complete systolic array 




clock cycle 



Fig. 7. Power consumption trace of 480-bit Montgomery modular multiplier from VC- 
CINT 



The measurement of a 480-bit MMMC is depicted in Fig. 7. The three parts 
shown in the figure can be explained according to the algorithm and architecture 
used. The T register in Fig. 6 is reset in the beginning of the MMM operation 
and then it is being written. The number of bits in T which are updated is 
increasing until clock cycle 1. This stage corresponds to the first part shown in 
power consumption trace. After I clock cycles all the bits of the T register have 
a value and all of them are updated before clock cycle 21. This stage is shown 
by the second part in Fig. 7. The last part in Fig. 7 corresponds to reading out 
the result from the pipeline. Because there is no new input on the LSB of the 
systolic array, starting from clock cycle 21 + 1 the number of MSBs of the T 
register that are updated decreases. 



5.3 Elliptic Curve Point Addition and Doubling 

For the representation of the points on the elliptic curve we use modified Ja- 
cobian coordinates as proposed by Cohen et al. in [CM098]. These points are 
represented as quadruple (X, Y, Z, aZ"^). When we convert the input point P 
from affine coordinates to projective coordinates we take Z as 1. Because there 
are both MMMC and modular addition/subtraction (MAS) circuits available, 
these operations can be executed in parallel. When an EC point addition is used 
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in an EC point multiplication one of the inputs of the EC point addition cir- 
cuit is always the input point P. Algorithm 2. (a), and (b)., describe the point 
addition and the point doubling operation, resp. 



Algorithm 2 EC point addition and doubling 



Require: Pi = (x,y,l,a), 

P2 = (X2,Y2,Z2,aZl) 

Ensure: Pi + P 2 = P 3 = (A' 3 , Y 3 , Z 3 ,aZi) 
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(b) 





The multiplications and the squarings use MMMC, while the additions, dou- 
blings and subtractions employ the modular addition/subtraction circuit. The 
power consumption trace of one 160-bit EC point addition is shown in Fig. 9. 
Fourteen states can be counted easily from the trace. All the states are com- 
pleted in nearly 500 clock cycles The power consumption during Step 3, 5, 7, 8, 
10, 11 and 13 seems higher than for the other steps. In these steps a modular 
addition or subtraction is taking place as well as an MMM. 

The MAS operation is performed in two steps as addition-subtraction or 
subtraction-addition. Depending on the result of the first operation the second 
operation takes place or is ignored. This behavior can be observed when we zoom 
in Step 3 as shown in Fig. 8. This figure shows that after 160 cycles the first 
subtraction ends and the next addition operation start. The addition lasts 160 
clock cycles. 

The power consumption trace of one 160-bit EC point doubling is shown in 
Fig. 10. As expected, the number of clock cycles for EC point doubling is less 
than the number of clock cycles for EC point addition. The main difference in 
power consumption between EC point addition and EC point doubling can be 
observed by looking at Step 7, 8, 10, 11, 12 and 14. In these steps only modular a 
addition/subtraction takes place. Obviously the latency and power consumption 
of these are smaller than the others. This means that a simple power-analysis 
attack is easy to perform. 
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Fig. 8. Power consumption trace of Step 3 of 160-bit EC point addition 




Fig. 9. Power consumption trace of 160-bit Fig. 10. Power consumption trace of 160- 
EC point addition from VCCINT bit EC point doubling from VCCINT 



The EC point multiplication is implemented by using a simple double-and- 
add algorithm. For EC point addition and EC point doubling the circuits de- 
scribed above are used. The power consumption trace of a 160-bit EC point 
multiplication is shown in Fig 11. It can be easily seen from figure 11 that the 
key used during this measurement is 1001100. 

5.4 Applications and Futnre Work 

Our board makes it possible to verify the effectiveness of many of the proposed 
countermeasures for various algorithms. In particular, we believe that counter- 
measures that are based on masking or blinding intermediate values (see for 
example [Koc96] and [Cor99] for approaches on asymmetric schemes), can be 
evaluated with our board. Also countermeasures for elliptic curve cryptosystems 
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Fig. 11. Power consumption trace of a 160-bit EC point multiplication from VCCINT. 



which are based on clever implementations of the elliptic curve operations [TB03] 
can be checked. All software based countermeasures can be evaluated with our 
setup. We plan to validate some of the countermeasures, such as [TB03] and to 
apply EM attacks [GMOOl] on the FPGA. 

6 Conclusion 

We introduced a new platform for evaluating power analysis. Our approach con- 
sists of an FPGA, which is placed on a hand-made board which makes it very 
easy to conduct power-analysis attacks. We characterized the power consump- 
tion of a XILINX Virtex 800 FPGA and conclude that it is similar to the power 
consumption of an ordinary ASIG in GMOS technology. Therefore, it is possible 
to draw conclusions about the vulnerability of a certain circuit by perform- 
ing power-analysis attacks on an FPGA-implementation. Since programming an 
FPGA is considerably cheaper than manufacturing an ASIG, assessing a devices 
vulnerability towards power-analysis attacks is much cheaper on our platform. 
Gonsequently, our approach describes the first cheap and efficient way to con- 
duct power-analysis attacks on a real implementation (i.e., not on a software 
simulation) of a circuit in a very early stage of the design flow. 
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A The Xilinx Virtex Architecture 

Virtex devices feature a flexible, regular architecture that comprises an array 
of configurable logic blocks (CLBs) surrounded by programmable input/output 
blocks (lOBs), all interconnected by a rich hierarchy of fast, versatile routing 
resources. Virtex FPGAs have a coarse-grained architecture, are SRAM-based, 
and are customized by loading configuration data into internal memory cells. 



Configurable Logic Block. The basic building block of the Virtex CLB is 
the logic cell (LC) [XilOl]. A LC includes a 4-input function generator, carry 
logic, and a storage element. The output from the function generator in each 
LC drives both the CLB output and the D input of the flip-flop. Each Virtex 
CLB contains four LCs, organized in two similar slices. Figure 12 shows a more 
detailed view of a single slice. In addition to the four basic LCs, the Virtex CLB 
contains logic that combines function generators to provide functions of five or 
six inputs. 

The Virtex function generators are implemented as 4-input look-up tables 
(LUTs). In addition to operating as a function generator, each LUT can provide 
a 16 X I-bit synchronous RAM. The storage elements in the Virtex slice can be 
configured either as edge-triggered D-type flip-flops or as level-sensitive latches. 
The D inputs can be driven either by the function generators within the slice 
or directly from the slice inputs, bypassing the function generators. In addition 
to Clock and Clock Enable signals, each Slice has synchronous set and reset 
signals (SR and BY). All the control signals can be inverted independently and 
are shared by the two flip-flops within the slice. 



I/O Block. The Virtex I/O Block (lOB) features SelectIO inputs and outputs 
that support a wide variety of I/O signaling standards [XilOl]. The three lOB 
storage elements function either as edge-triggered D-type flip-flops or as level 
sensitive latches. Optional pull-up and pull-down resistors and an optional weak- 
keeper circuit are attached to each pad. Prior to configuration, all pins not 
involved in configuration are forced into their high-impedance state. 
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Fig. 12. Simplified diagram 



I/O Banking. Some of the possible I/O standards require VCCO and/or VREF 
voltages. These voltages are connected to the device pins that serve groups of 
lOBs, called banks. Consequently, not all I/O standards can be combined within 
a given bank. Each bank has multiple VCCO (Output supply voltage) pins, all 
of which must be connected to the same voltage. This voltage is determined by 
the output standards in use. 



Configuration of the FPGA. Virtex devices are configured by loading con- 
figuration data into the internal configuration memory. Some of the pins used 
for this are dedicated configuration pins, while others can be re-used as general 
purpose inputs and outputs once configuration is complete. Virtex supports four 
configuration modes which are the Slave-serial mode, the Master-serial mode, 
the SelectMAP mode and the Boundary-scan mode. The configuration mode 
pins (M2, Ml, and MO) define which of these modes is used. Our board supports 
three of these configuration modes. 
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Abstract. Bernstein [1] and Lenstra et al. [5] have proposed specialized 
hardware devices for speeding up the linear algebra step of the number 
field sieve. A key issue in the design of these devices is the question 
whether the required hardware fits onto a single wafer when dealing 
with cryprographically relevant parameters. 

We describe a modification of these devices which distributes the tech- 
nologically challenging single wafer design onto separate parts (chips) 
where the inter-chip wiring is comparatively simple. A preliminary 
analysis of a ‘distributed variant of the proposal in [5]’ suggests that 
the linear algebra step for 1024-bit numbers could be doable on a 
23 X 23-network with special purpose processors in less than 19 hours at 
a clocking rate of 200 MHz, where each processor has about the size of 
a Pentium Northwood. Allowing for a 16 x 16 mesh of processing units 
with 36 mm x 36 mm, the linear algebra step might take less than 3 
hours. 

Keywords: Factorization, number field sieve, linear algebra, RSA 



1 Introduction 

Nowadays, the most common algorithm for factoring large integers is the so- 
called number field sieve (NFS). The NFS involves two computationally par- 
ticularly expensive steps — the relation collection step and the task of solving 
a large sparse system of linear equations over GF(2) resp. of finding a linear 
dependence among binary vectors. In this contribution we deal only with the 
latter step. Based on the block Wiedemann algorithm [4,7], Bernstein [1] and 
Lenstra et al. [5] recently proposed specialized hardware devices for speeding up 
this part of the NFS. 

In the present form, a major problem of these proposals is the size of the 
circuits and thereby the question of scalability: for larger parameter values, the 
proposed circuits do not fit onto a single wafer of diameter 300 mm any more, 
and high-speed communication between wafers is quite difficult to realize. But 
having in mind imperfections in actual manufacturing processes, already a single 
wafer design as proposed in [5] is rather non-trivial to realize. For circumventing 
this problem, in this paper we propose a technique for distributing the algorithms 
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in [1,5] in such a way onto several wafers, that — at least for the case of 1024- 
bit numbers — both the performance of the algorithms does not decrease and 
the inter-wafer communication can be kept rather simple. It is appropriate to 
mention here that the idea of distributing the linear algebra step onto several 
‘smaller computers’ is not new; e. g., in [2] ideas for implementing the linear 
algebra step ‘in parallel on a network of relatively small machines’ are described. 

In Section 2 we shortly recall the essential hardware requirements of the 
two specialized architectures due to Bernstein and Lenstra et ah, and thereafter 
we describe a method for overcoming the hardware limits of these approaches 
to a certain extent. To get a better idea of the possible use of our approach. 
Sections 3.2 and 3.3 analyze a ‘multi-wafer’ variant of the proposals in [1,5] for 
512-bit and 1024-bit numbers in more detail. It turns out that even for 1024-bit 
numbers the linear algebra step seems to be doable within a few hours by means 
of a distributed hardware that can be manufactured with currently available 
technology. 

2 Two Architectures for the Linear Algebra Step 

Within the relation collection step of the NFS a (w. 1. o. g. square) sparse matrix 
A G GF(2)’"’^'" is constructed. For 1024-bit numbers, the estimations in [5, 
Section 5.1] suggest values of m « 4-10^ or m « 10^°, where on average a column 
contains about 100 non-zero entries. For representing the matrix A throughout 
the computations, only the coordinates of its non-zero entries are stored. 

To find the linear dependency among the columns of A needed in the NFS, 
the proposals in [1,5] make use of the block Wiedemann algorithm. Basically, 
this algorithm reduces the problem of finding a linear dependency among the 
columns of A to the problem of computing efficiently (long) sequences of the 
form 

A-v,A^ ■v,...,A'‘ -V 

where v is a — not necessarily sparse — binary vector v G GF(2)"‘. A typical value 
is fc « 2m jK with a blocking factor AT = 1 or AT > 32 (for a blocking factor 
AT > 1 several different values of the vector v are handled simultaneously). 

Accordingly, for reducing the cost of the matrix step in the NFS, the devices 
proposed by Bernstein and Lenstra et al. aim at reducing the time required for 
computing such iterated (left-)multiplications with A. While the construction in 
[1] uses a parallel sorting algorithm for this purpose, the proposal in [5] relies on 
the use of a parallel routing algorithm. In the next two sections we shortly recall 
the respective hardware requirements of these devices; for an explanation of the 
algorithmic details we refer to the original papers. 



2.1 Bernstein’s Device for the Matrix Step 

Goncerning hardware requirements, the essential algorithmic tool in the proposal 
of [1] is Schimmler’s sorting algorithm [1,6]: assume we are given a mesh of 




Hardware to Solve Sparse Systems of Linear Equations over GF(2) 53 



M X M processing units {Qi,j)i<i,j<M where M := 2” and each processing unit 
Qij stores an integer value qij. Then Schimmler’s sorting algorithm allows for 
sorting these numbers in 8M — 8 ‘steps’ according to any of the following 
orders on the indices (i,j) of the processing units Qij. 

left-to-right: (1, 1) < (1, 2) < . . . < (1, M) < (2, 1) < . . . < (M, M) 
right-to-left: (1, M) < (1, M - 1) < . . . < (1, 1) < (2, M) < . . . < (M, 1) 
snakelike: (1, 1) < (1, 2) < . . . < (1, M) < (2, M) < (2, M - 1) < . . . < (M, 1) 

An ‘elementary step’ of the algorithm looks as follows: analogously as in the 
odd-even transposition sorting, in a single step each processing unit Qij com- 
municates with exactly one of its horizontal or vertical neighbours. So let Q, Q 
be two communicating processing units, and denote by q, q the integers stored 
in Q, Q, respectively. At the end of one ‘elementary step’ one of the two pro- 
cessing units, say Q, must hold the value min(< 7 , q) while the other one has to 
store max(< 7 , q). For achieving this one can proceed as follows: 

1. Q sends q to Q, and Q sends q to Q. E.g., if the stored integers represent 
natural numbers < 2^®, this operation can be completed in one clock cycle 
via a unidirectional 26-bit bus in each direction. 

2. Both Q and Q compute the boolean value exchange := {q < q). E.g., if q 
and q are 26-bit numbers, this comparison can be done in one clock cycle. 

3. If exchange evaluates to true, then Q stores q and deletes q. Analogously, Q 
keeps q and deletes q, in this case. If exchange evaluates to false, then both 
Q and Q keep their old values and delete the values received in the first 
step. Again, for 26-bit integers this operation does not require more than 
one clock cycle. In fact it is feasible to integrate this step into the previous 
one without requiring an additional clock cycle. 

In summary, when dealing with natural numbers < 2^®, Schimmler’s sorting 
algorithm enables us to sort numbers in less than 8M steps where each 
step takes 2 clock cycles. Assuming that each column of A contains d non-zero 
entries, one matrix-vector multiplication requires m • d processing units to store 
the matrix A and m processing units to store the entries of the vector v. Using 
Bernstein’s approach, a matrix-vector multiplication can thus be realized on a 
mesh of size M x M, provided that > d-m + m. For one multiplication three 
sorting steps with « 8 • M exchange operations, requiring 2 clock cycles each, 
are necessary. Consequently, one matrix-vector multiplication can be performed 
in approximately 3 • 2 • 8 • M = 48 • M clock cycles. 

For the factorization of 1024-bit numbers (using the ‘small matrix’ with m « 
4 • 10^), in [5] the average number of transistors per processing unit is estimated 
to be around 2000. Assuming that a standard 0.13 iim manufacturing process is 
used, with [5, Table 2] we thus conclude, that one processing unit requires an area 
of «4760 /xm^ resp. a square of about 0.07x0.07 mm^. Analogously, assuming 
a processing unit for the case of 512-bit numbers to require 1800 transistors, 
we obtain an estimated area of «4280 ^m^ per processing unit in this case. 
Here, the estimation for the number of transistors is based on a matrix of size 
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6.7- 10^ X 6.7 • 10® where each column contains 63 non-zero entries (cf. [3]); for 
this matrix size 23 bits are sufficient to represent a column or row index. 

2.2 Lenstra et al.’s Device for the Matrix Step 

Similarly as Bernstein’s proposal, the architecture put forward by Lenstra et al. 
is based on a mesh of simple processing units. But as opposed to [1], the mesh 
is used for routing rather than sorting. Concerning the factorization of 1024- 
bit numbers, Lenstra et al. discuss two possible matrix sizes (m « 4 • 10^ resp. 
m « 10^®). For describing the device that is to fit onto a single wafer of diameter 
300 mm, the ‘small matrix’ is used, and we restrict our discussion to this case. 

Depending on the precise choice of parameters, the mesh [5] uses one or two 
types of processing units. For the single wafer device just mentioned, only one 
type is used, and each node is a so-called target node. Basically, this means that 
each node stores all row coordinates of p > 1 non-zero entries of A as well as 
p entries of v. After having performed a complete matrix- vector multiplication 
A-v, the entries storing v are replaced by the entries of the vector A-v. Denoting 
again by d the number of non-zero entries per column, the non-zero entries of 
A can be distributed onto m/ p processors where each processor has sufficient 
DRAM for storing p ■ d matrix entries. 

The main tool utilized for the actual computation of a matrix-vector mul- 
tiplication is so-called clockwise transposition routing which relies on the iter- 
ated application of (parallelly executed) exchange operations. Routing a single 
value takes about 2 • \/mJp clock cycles, and the processing of the individ- 
ual matrix entries and matrix columns can overlap. As a worst-case bound we 
can assume a complete matrix-vector multiplication to require no more than 
p ■ d-2 ■ ^JrnJp = 2 ■ d ■ yjm • p clock cycles. 

So far, our discussion ignored the blocking factor K\ as pointed out in [5], 
for a given blocking factor K , Wiedemann’s algorithm requires the computation 
of K multiplication chains 



A-Vi,A^ -Vi,...,A^ ■ Vi 

with different vectors Ui (1 < i < K). Using a slightly more complicated hard- 
ware (see [5] for details), these K chains can be computed in parallel with the 
same routing circuit. Basically, for blocking factor K each processing unit needs 
2 ■ p ■ K bit of memory to store the vectors vt. In particular, the value of K 
is relevant when estimating the space requirement for the target units; here we 
assume K = 208 (the value chosen in [5] for the single wafer device for 1024-bit 
numbers). 

With m « 4 • 10^ the average number of transistors necessary for one target 
unit — excluding DRAM — can be estimated to be around 2040 + 60 ■ K (cf. Sec- 
tion 3.2). The area needed for a single DRAM bit is « 0.7 pm^ resp. 0.2 pm"^ 
with a specialized DRAM process. Thus, for the ‘small matrix case’ of 1024-bit 
numbers, the DRAM of a complete target cell occupies about 88700 pmi^ resp. 
25300 pun^ with a specialized DRAM process. The space requirement for the 
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2040 + 60 • 208 transistors computes to 34600 and 40700 respectively. 
In total, for one target cell 123300 with a standard process resp. 66000 /xm^ 
with a specialized DRAM process are needed in the 1024-bit case. When dealing 
with 512-bit numbers — and a matrix of size 6.7 • 10® x 6.7 • 10® with 63 non-zero 
entries per column — the DRAM per target cell occupies only 54800 /xm^ resp. 
15700 with a specialized DRAM process. Adding the space for 1920-I-60-208 
transistors, we obtain a total space requirement of 89100 fiTc? resp. 56100 /xm^ 
per target cell. 

2.3 Estimated Mesh Size and Performance for 512 Bit and 1024 Bit 

With a standard 0.13 /xm process, there fit « 2.915 • 10^® transistors on a single 
wafer of diameter 300 mm (cf. [5]). Using the above figures, a straightforward 
computation now yields the following estimation for the wafer area and time 
needed by Bernstein’s approach, when dealing with log 2 (n)-bit numbers: 



log2(n) 


m 


d 


# proc 


M 


area in wafers 


clock cycles 


LA 


512 


6.7-10® 


63 


4.3 • 10® 


215 


74 (26.6) 


1.6- 10® 


18 h 


1024 


4- 10^ 


100 


4.04- 10® 


2i® 


295 (277.2) 


3.1- 10® 


207 h 



The area in brackets indicates the area required to store the matrix and the 
vector; the difference to the real area required comes from choosing M as a 
power of 2 with > m ■ {d+ 1). The last column (labeled with LA) is the 
estimated total time of the linear algebra step; more precisely, the value given is 
the time for performing 3-m matrix- vector multiplications (see [5]) at a clocking 
rate of 500 MHz. 

A 512-bit device with these parameters has a size of 2.14 mx2.14 m; for the 
1024-bit case we obtain a (wafer) area of 4.5 mx4.5 m. Thus, realizing such a 
device seems quite hypothetical. 

For the device described by Lenstra et al. [5] we assume p = 42, i. e., each 
target unit takes care of 42 matrix columns (cf. [5, Table 3]). For 512-bit numbers, 
then 402^ target units including DRAM fit on an area of 95 mmx95 mm; and 
for 1024-bit numbers 1026^ target units including DRAM fit on the area of a 
single wafer. This results in the following estimated space and time requirements 
for performing the linear algebra step: 



log2(n) 


m 


d 


# targets 


M 


area (wafers) 


elk cycles 


LA 


512 


6.7-10® 


63 


1.6- 10® 


400 


0.13 


2.1 - 10® 


17 min 


1024 


4-10^ 


100 


9.5-10® 


1024 


1 


8.2- 10® 


6.5 h 



Here the estimatated total time of the linear algebra step is the time for per- 
forming 3 • to/AT = 3 • m/208 matrix- vector multiplications at a clocking rate of 
200 MHz (cf. [5]). 

3 Distributing the Computation 

In several of the above mentioned sizes and in several other parameter choices 
described in [5], the specialized hardware for the linear algebra step does no 
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longer fit on a single wafer. But due to the practical limitations of manufactur- 
ing processes, already realizing the single wafer devices is quite challenging. In 
this section we want to discuss an approach for circumventing the problem of 
handling sophisticated, highly parallel I/O hardware for fast inter-wafer com- 
munication; at least for 1024-bit numbers this approach seems to improve the 
situation significantly. 



3.1 Using Block Matrix Multiplication 

For our discussion we adopt the assumption from [5] that the non-zero entries 
in the matrix A G GF(2)™^’” are uniformly distributed. It should be empha- 
sized, that the original matrix A cannot be expected to have such a uniform 
distribution, and here we do not discuss the problem of how a preprocessing for 
achieving this could look like. The ‘rectangular matrix blocks’ we will use should 
allow for some leeway here, and subsequently we make the assumption that a 
suitable preprocessing has been done already, e. g., by having applied suitable 
row and column permutations to the original A. 

We start by splitting the matrix A into s ■ s submatrices A^j, 1 < i, j < s 
of approximately the same size of m/s x m/s. It is not mandatory that all Ai^j 
are square matrices, but we insist that for fixed ig G {1, . . . , s} all matrices A^gj 
(1 < j < s) have the same number of rows: 



/ ^1,1 1 ^ 1 , 2 ] • 


•• ^l.s\ 


I ^2,1 1^2,2 1 ■ • 


■ ^2,s 




\^S,1 1 ^ 5 , 2 ! • 


• • \^s,s J 



The size of the hardware devices in [1,5] depends on the number of non-zero 
entries in the processed matrix, and the aim of the separation just mentioned is 
to split the matrix into submatrices with approximately the same number of 
non-zero elements. After splitting the vector v into appropriately sized parts 




/ '^s,l 
\Vs,s 



(where the number of rows of Vij is equal to the number of columns of Aij), 
the multiplication A ■ v can be realized as 



A - V = 



/ZLi • Uj 



V = l ^s,j ■ Vs,j 



This can be performed with multiplication circuits (preloaded with the ma- 
trices Aij) in the following way: 
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1. Load Vij into the circuit corresponding to Aij through a pipeline of length 
s and a bus of width b (if Aij is not a square matrix, we can think of the 
missing column/row entries as being 0). 

2. Perform the matrix-vector multiplications Aij ■ Vij in parallel. 

3. Output the resulting vectors Aij ■ Vij through a bus of width b. 

4. Perform the summations Wi := = 1) ■ • ■ ) s) with s 

XOR-pipelines of length s. Each of these XOR-circuits is adjacent to one 
multiplication circuit and adds the output of this circuit to the output of the 
previous stage of the pipeline. Each of these XOR-circuits has two inputs 
and one output of width b and works during the output of the multiplication 
circuits in a pipeline architecture. 

These XOR-circuits are extremely simple, but have to be built up out of 
several chips due to the width of the bus. A different approach is to include 
these XORs into the adjacent multiplication circuit; then the saved hardware 
has to be paid for by a doubling of the I/O time. This part of the hardware 
should not cause a major problem and is neglected here. 

5. Analogously as the vector v before, now the vector w := (wi, . . . ,Ws) is 
split into parts Wij. These Wij are now ready to be loaded into the 
multiplication circuits to perform the multiplication A - w \i required. 

At this stage we can also easily perform a vector- vector multiplication of the 
form u-Av = u-w as needed in the block Wiedemann algorithm. The vectors 
u are usually chosen to be of very low Hamming weight, and we thus ignore 
the (marginal) computational effort of these multiplications. 

The loading of A • ?; is performed through a pipeline structure, similar to 
the XOR-pipeline for the outputs. If the XOR-circuits are extended with an 
additional register and a multiplexer (to switch between the ‘horizontal’ and 
‘vertical’ bus), the same chips can be used for both pipelines. 



3.2 Performance of the Distributed Device 

Let us now look at the space and time requirements of the distributed architec- 
ture just described; for sake of simplicity, we assume all submatrices Aij to be 
(m/s) X {m/s) square matrices. We also consider the choice of a blocking factor 
AT > 1; for Lf > 1 several vectors are handled in parallel, and of course we have 
to take into account the additional bandwidth required here. 

Step 1. Loading the K vectors Vij into the multiplication circuits requires ap- 
proximately 4 • m • K/{s • 6) -|- 4 • s clock cycles, where the time for loading 
one bit is estimated to be 4 clock cycles, and 4 • s clock cycles are needed to 
empty the pipeline. All but the last b input bits can be distributed to the 
processing units (or target cells) while the following input bits arrive. The 
extra time required to distribute the last b bits is neglected here. 

Step 2. Each of the multiplication circuits has to perform a multiplication of a 
matrix with about m-d/s'^ entries with a binary vector of size approximately 
m/s. With Bernstein’s approach, this can be done in 48 • M clock cycles on 
an M X M mesh where M > \Jm ■ d/s"^ + m/ s is a power of 2. 
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With the design from [5], this matrix- vector multiplication requires tar- 

get cells for p columns each, where > m/(s • p). At this, each target cell 
is equipped with DRAM for p ■ d/s matrix entries, and we can estimate the 
number of clock cycles required for the matrix-vector multiplication to be 
no larger than 2 ■ p ■ d ■ M / s. 

Step 3. Transfering the vector Wi to the output buffer requires one sorting step 
in Bernstein’s architecture (« 2 ■ 8 ■ M clock cycles). In the architecture 
of Lenstra et al. the computational effort of this step is negligible (due to 
the known addresses of the bits of Wi, and thus the possibility to use a 
pipeline procedure during the output); we estimate the output to require 
« 4 • TO • K/{s ■ b) clock cycles. 

Step 4. The summation of the subvectors is performed with a pipeline structure 
while the outputs of the multiplication units arrive. Additional « 4 • s clock 
cycles are needed to empty the pipeline. 

Summarizing our discussion, we have the following characterizing figures: 

— With Bernstein’s architecture processing units for each of the parts 
are required, where M is a power of 2 and > to • d/s'^ + m/s. Taking 
into account the registers (2- 8 - [log 2 (TO/s)] transistors), multiplexers (3-4- 
[log 2 (TO/s)] tansistors), a subtraction unit (5 • 8 • |"log 2 (TO/s)] ), and control 
logic (300 transistors) needed, we estimate that one processing unit consists 
of « 68 • |"log 2 (TO/s)] -I- 300 transistors.^ 

The number of clock cycles for a complete matrix-vector multiplication is 
approximately 8 • |"to/(s • 6)] -I- 48 • M -|- 8 • s. 

— With the architecture of Lenstra et al. > m/{s- p) processing units resp. 

target cells are used on each of the parts, if one target cell takes care 

of p matrix columns. Taking into account the required DRAM for repre- 
senting the non-zero matrix entries {p ■ d ■ \\og 2 {m / s)'\ / s bit), the DRAM 
bits for storing the K processed vectors {2 ■ p ■ K bit) along with an access 
logic (40 • K transistors), a register for a received ‘package’ from the mesh 
(8 • ([log 2 (TO-/s)] -I- K) bit), three multiplexers (3 • 4 • (|'log 2 (TO/s)] -I- K) 
transistors), a subtraction unit ((8 • 5 • (log 2 (TO/s)] )/2 transistors), and ad- 
ditional logic (1000 transistors) we estimate one processing unit to require 
« 40- (log 2 (TO/s)] -I- 60 -AT -I- 1000 transistors and p-{d- |"log2(TO/s)]/s-|-2- iL) 
bit of DRAM. The number of clock cycles for a complete matrix-vector mul- 
tiplication is « 8 • [to • K/{s -5)] + 2- p - d - M/s -I- 8 • s. 

In the next section we examine in more detail the performance of this distributed 
approach when dealing with 512-bit and 1024-bit numbers. For doing so, we 
consider various choices of the bus width b, the blocking factor K, the number 
of columns p handled per target cell, and the ‘degree of parallelism’ s. 

^ Note that for storing a row or column index of a submatrix Ai^j G Qp( 2 )™Ax"i/s 
only [log 2 (m/s)] bits are needed. 
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3.3 Application to 512-Bit and 1024-Bit Numbers 

Having in mind a practical manufacturing process, it is desirable that the indi- 
vidual parts of the distributed circuit are significantly smaller than (the inner 
square of) a complete 300 mm wafer. For the distributed circuit derived from the 
architecture in [5], we choose the size of the individual parts to be comparable 
to the size of an ‘ordinary’ Pentium Northwood processor. To avoid problems 
with the available number of pins connected to the bus b, we use somewhat 
conservative estimations for the bus width. 

For Bernstein’s circuit such small processing units are not really sensible, and 
we choose the individual parts to be larger. For these larger parts (or in other 
words chips), it is sensible to allow for a larger bus width b, as more pins can be 
located on the chip here. 

The bus width b also limits the possible choices of the blocking factor K : be- 
fore performing the (next) matrix-vector multiplication by means of a (routing or 
sorting) mesh, we have to load the respective parts of the vector to be processed 
next into the processing units via the bus. However, with a simple trick we can 
gain some parallelism ‘almost for free’: assume that each part of the distributed 
device — in other words each chip — handles K vectors in parallel (for Bernstein’s 
approach we have K = 1). Then while these K vectors are processed, we can 
load another A-tuple of vectors into a separate buffer on that chip. So once the 
result of the previous multiplication is output, we can immediately load the new 
vectors into the mesh. If the I/O time is about the same as the computation 
time, then by interleaving the processing of two ‘tuples of vectors’ in this way, 
we can in the ideal case almost halve the time needed for loading vectors onto 
and from the chips (of course, the cells to store these additional vectors require 
additional place on the chip, which has to be taken into account then). 

Table 1 and 2 show the performance of the distributed device for various 
parameter choices; at this, a potential optimization by ‘interleaving tuples of 
vectors’ is not taken into account. As in Section 2.3, for estimating the total 
time of the linear algebra step, the number of multiplications is assumed to be 
3 • TO in the design derived from Bernstein’s proposal, and 3 • to/A in the design 
derived from the proposal of Lenstra et al. 



Table 1. Time for the LA step with a ‘distributed Bernstein design’ at a clocking rate 
of 500 MHz. 



I0g2(«) 


b 


#proc/chip 




chip size 


LA time 


512 


(single unit) 


23768^" 


1 


2.14 m 


X 


2.14 m 


17.6 h 


512 


2048 


2048^ 


11^^ 


144 mm 


X 


144 mm 


1.1 h 


512 


1024 


l02? 


24^" 


72 mm 


X 


72 mm 


0.6 h 


512 


1024 


512^ 


55^ 


36 mm 


X 


36 mm 


0.3 h 


1024 


(single unit) 


65536^" 


1 


4.5 m 


X 


4.5 m 


210 h 


1024 


2048 


2048^ 


37‘" 


144 mm 


X 


144 mm 


6.8 h 


1024 


1024 


102? 


84^" 


72 mm 


X 


72 mm 


3.4 h 
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Table 2. Time for the LA step with a ‘distributed Lenstra et al. design’ at a clocking 
rate of 200 MHz. 



I0g2(?l) 


b 


K 


P 


#proc/chip 


? 


chip size 


LA time 


512 


128 


65 


116 


76^" 


10’' 


11.4 


mm 


X 


11.4 mm 


73 min 


512 


128 


53 


51 


9P 


16’' 


11.4 


mm 


X 


11.4 mm 


45 min 


512 


512 


63 


29 


l5? 


10’' 


20 


mm 


X 


20 mm 


19 min 


512 


1280 


49 


8 


29(P 


10’' 


34 


mm 


X 


34 mm 


8 min 


512 


1024 


40 


6 


26? 


16’' 


29 


mm 


X 


29 mm 


6 min 


1024 


(single unit) 


208 


42 


975’' 


1 


265 


mm 


X 


265 mm 


6.1 h 


1024 


(single unit) 


42 


216 


43? 


1 


162 


mm 


X 


162 mm 


94.7 h 


1024 


128 


30 


1086 


4? 


16’' 


11.4 


mm 


X 


11.4 mm 


29.7 h 


1024 


128 


70 


669 


51’' 


23’' 


11.4 


mm 


X 


11.4 mm 


18.8 h 


1024 


512 


100 


250 


lb? 


16’' 


20 


mm 


X 


20 mm 


7.0 h 


1024 


1024 


160 


278 


I2? 


10’' 


30 


mm 


X 


30 mm 


5.9 h 


1024 


1280 


135 


66 


19? 


16’' 


36 


mm 


X 


36 mm 


2.8 h 



For Bernstein’s approach we recognize that the distributed variant looks 
much more practical than the original design. Also it is worth noting, that the 
communication cost — i.e., the time for loading vectors onto/from the chips — is 
less than 5% of the overall computation time, and thus is not really relevant. For 
the design of Lenstra et al. the situation is quite complementary: more than 90% 
of the time is spent for the I/O operations. However, the obtained circuitry is 
much smaller and thus more realistic than the sorting based approach. In partic- 
ular, with a mesh of 23^ = 529 Pentium Northwood sized processing units, the 
linear algebra step for a 1024-bit number should be doable in less than 19 hours. 
Note here that the overall wafer area of this distributed device is the same as 
for the original single wafer design of Lenstra et al. Concerning speed the distri- 
bution has to be paid for with a slow-down of more than a factor 3. However, 
manufacturing the small processing units is significantly simpler. Further on, 
already with slightly larger processing units — which allow for a broader bus — , 
the overall computation time can be reduced to less than 3 hours. As most of the 
time is spent for the I/O, the bus width should be chosen as large as possible; 
in our estimations we tried to be conservative here. 

4 Conclusion 

The above discussion suggests that for 1024-bit numbers, a ‘distributed variant of 
the design of Lenstra et al.’ could be realizable by means of current technology. 
Besides circumventing a technologically challenging wafer-sized circuit, also a 
speed-up seems to be possible, if one allows for processing units of up to, say, 
36 mm x 36 mm. But already with a mesh of 23^ processing units, where each 
processing unit has approximately the size of a Pentium Northwood, the linear 
algebra step for 1024-bit numbers seems to be doable in less than a day. 
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Abstract. This paper presents the results of applying an attack against 
the Data Encryption Standard (DES) implemented in some applications, 
using side-channel information based on CPU delay as proposed in [11]. 
This cryptanalysis technique uses side-channel information on encryption 
processing to select and collect effective plaintexts for cryptanalysis, and 
infers the information on the expanded key from the collected plaintexts. 
On applying this attack, we found that the cipher can be broken with 
2^® known plaintexts and 2^^ calculations at a success rate > 90%, using 
a personal computer with 600-MHz Pentium III. 

We discuss the feasibility of cache attack on ciphers that need many 
S-box look-ups, through reviewing the results of our experimental 
attacks on the block ciphers excluding DES, such as AES. 

Keywords: DES, AES, Camellia, cache, side-channel, timing attacks 



1 Introduction 

Recently, many proposals have been made for cryptanalysis techniques to mea- 
sure physical information from a cryptographic device. These techniques are 
called “side-channel attacks.” Typical examples are Differential Power Analysis 
[5], which measures the variation in power consumption caused by a crypto- 
graphic device, and Differential Fault Analysis [1], which causes some sorts of 
physically erroneous operation to occur in a cryptographic device and then mea- 
sures resulting phenomena. Because techniques of this kind are mainly used for 
attacking cryptographic systems implemented on smart cards, anti-tampering 
measures e.g. adding noise to consumed power have been considered. “Timing 
attacks” [2] [6] that measure the encryption time of a cryptographic application 
can also be treated as side-channel attacks. A countermeasure to attacks of this 
type is to eliminate branch processing in the implementing algorithm so that 
encryption times are equivalent. 

Previously proposed timing attacks make use of the fact that conditional 
branches that occur during encryption processing cause variations in encryption 
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time. CPU cache misses, however, can also cause such variations. In this regard, 
most of the recent computers employ a “CPU cache”, abbreviated simple to a 
“cache” from here on, between the CPU and main memory, since this type of 
hierarchical structure can speed program run-time on the average. If, however, 
the CPU accesses data that were not stored in the cache, i.e. if a cache miss 
occurs, a delay will be generated, as the target data must be loaded from main 
memory into the cache. The measurement of this delay may enable attackers to 
determine the occurrence and frequency of cache misses. 

With the above in mind, we have focused our attention on data-access pro- 
cessing, i.e. the operations of the S-box commonly used by encryption algorithms, 
and have developed a new attack technique to infer the information on S-box 
input from the variations in encryption time for different plaintexts. This is 
classified as a side-channel attack on software-implemented ciphers, and it has 
already broken MISTYl [11] successfully. It does not require specialized mea- 
suring equipment; the cipher can be broken in a relatively short time using a 
personal computer, if the encryption module of the cipher is available. Though 
Kelsey et al. described the feasibility of a cache-based attack on ciphers using 
a large S-box e.g. Blowfish [4], they did not refer a specific method. The first 
application of an attack using a cache is described in [11]. 

We made experimental attacks on some block ciphers including Data Encryp- 
tion Standard (DES). This paper describes the cases we could break the cipher 
in spite of frequent S-box look-ups, or the resistance to the attack described in 
[ 11 ]. 

This paper is organized as follows. Section 2 describes the basics of the pro- 
posed attack. Section 3 then describes the method of applying this attack to 
DES and presents the results of our experiment. Section 4 shows the results of 
this attack on AES and Camellia. Lastly, section 5 concludes the paper. 



2 The Basics of Attack 

2.1 Cache Operation 

A cache is a form of memory that allows faster reading and writing of data 
than those in a main memory. It is located between the CPU and main memory. 
When reading data from main memory, the CPU first checks the cache, and if 
the target data is present, it reads the data from the cache. Finding data in the 
cache in this way is called a “cache hit,” while not finding data in the cache and 
reading it from main memory is called a “cache miss.” In the latter case, the data 
read from main memory is also written to the cache so that any subsequent 
reading of this data might speed up. In short, a delay in processing will occur 
even for the same instruction if target data does not exist in the cache, and this 
delay will appear as a variation in the program execution time. 

^ In reality, values near the referenced one will also be loaded into the cache. 
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K, 



Fig. 1. Cipher With Two S-boxes 



2.2 The Encryption Time and S-Box Operation 

As described in Section 2.1, plaintext with the long encryption time should 
correspond to the frequency of cache miss. In the following, we examine the 
conditions for the generation of cache misses in the encryption process. 

In the encryption processing, data access occurs when the S-box is referenced. 
What then are the conditions that would generate more cache misses when 
referencing the S-box? Consider that a cache miss occurs when first referencing 
the S-box and that the data in question is therefore loaded into the cache as 
described earlier. Now, when next referencing the S-box, if the S-box input value 
is the same as the already referenced value or its nearby one data referencing 
can be done by accessing the cache; a cache miss does not occur. If, however, 
a value excluding already referenced ones and their nearby ones is referenced, 
the desired data will not be found in the cache and will have to be loaded from 
main memory; cache miss occurs. Accordingly, when making multiple S-box 
references during the encryption process, the number of cache misses increases 
proportionally with the number of different S-box input values. 

Based on the above reasoning, the encryption time should be long if there 
are many different data referenced by the S-box during encryption. Thus, the 
measurement of the encryption time for a plaintext makes it possible to deter- 
mine whether that plaintext is of the type that generates many cache misses in 
encryption (i.e., plaintext for which there are many different S-box input values). 



2.3 Attack Model 

The cipher with two S-boxes shown in Fig. 1 is used to explain the basics of the 
process of obtaining information on keys, which exploits side-channel informa- 
tion. The structure shown in the figure employs independent keys Kq and K\ in 
different S-boxes. 

Referring to Fig. 1, we assume that the relationship between the input values 
of the two S-boxes under comparison is understood. The key differential value 

^ Values near the referenced value will be simultaneously loaded due to the character- 
istics of CPU. 
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Kq © Ki (referred to below as “key difference”) can therefore be inferred from 
the values of plaintext Pq and Pi, using either of the following relations. 

Po®Ko = Pi®Ki^Po®Pi = Ko®Ki ( 1 ) 

^0 ffi ^0 Pi ® Ki Pq (B P\ ^ Kq © Ki (2) 

In other words, if the plaintexts for which S-box input values are frequently 
the same or frequently different are collected by measuring the encryption time, 
the information on the key differences can be obtained from those plaintexts. 
The attack comprised of 2 processes, the one for obtaining the key differences 
and the one for collecting cache timing data described in Section 3.2 is called a 
“cache attack.” Obtaining key differences by a cache attack can reduce the key 
search space. 

As the structure shown in Fig. 1 can be found in many block ciphers, it is 
thought that cache attacks can be widely applicable to ciphers of this type. 



2.4 Non-elimination/Elimination Table Method 

As described above, a correlation exists between the encryption time and the 
relationship between input values of separate S-boxes. We consider the following 
two methods of obtaining a key difference, based on such information. 

The first method corresponds to the situation in which the input values 
of S-boxes under comparison are equivalent. In this case, Eq. (I) holds and 
the values for the key differences can be calculated from plaintext information. 
Implementing this method requires the collection of plaintexts resulting a short 
encryption time under the assumption that a plaintext having a small number of 
cache misses equals a plaintext having a short encryption time. It can therefore 
be guessed that most of the collected plaintexts result in equivalent input values 
between the S-boxes in question. Key differences can therefore be calculated for 
the collected plaintexts and the value counted most frequently can be regarded as 
the correct key difference. We call this method a “non-elimination table attack.” 

The second method corresponds to the situation in which the input values of 
S-boxes under comparison are different. In this case, Eq. (2) holds and values of 
improbable key differences can be excluded. Implementing this method requires 
the collection of plaintexts resulting a long encryption time under the assumption 
that a plaintext having a large number of cache misses equals a plaintext having 
a long encryption time. Thus, the most of the collected plaintexts are guessed 
to result in different input values between the S-boxes. Key differences for the 
collected plaintexts can therefore be calculated and the value that appears the 
least frequently is taken as the correct key difference. We call this method an 
“elimination table attack.” 

For DES, the number of S-box operations is 16, a rather small one, consid- 
ering that each S-box has 64 entries. Therefore, it is predicted that many input 
values will be different between the S-boxes, making it easy to collect plaintexts. 
Thus, we applied an elimination table attack on DES. 
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Fig. 2. Whole Structure and Round Function 



3 Attack on DES 

3.1 DES Structure 

DES has a 16-round Feistel structure. Each round function features eight S-boxes 
each with a 6-bit input and a 4-bit output. An S-box operates 16 times, a small 
number compared to its 2® = 64 entries. In the key scheduler, 48-bit of a 64-bit 
secret key is selected for each round, and its value is used as a expanded key for 
the corresponding round. Refer Fig. 2 and Fig. 3 for details. 

As shown in Fig. 3, the total number of left cyclic shifts is set to 28 bits, 
which means that (Cq, Dq) and (Cie, Diq) have the same value. Thus, (Ci, Di) 
used in the round 1 and (Cie, Diq) used in the round 16 are related by a 1-bit 
left cyclic shift. This relationship is used for the secret key recovery described in 
Section 3.2. 



3.2 Attack Technique 

This section describes the DES attack technique in detail. The steps making up 
this attack are divided into two main stages. Stage 1 is used to collect plaintexts 
for encryption, while Stage 2 is used to obtain key differences from the collected 
plaintexts. These stages are performed independently of each other. 

The experiment described in this paper was done in the machine and compile 
environment summarized in Table 1. 
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Secret key 




Fig. 3. Key Schedule 



Collection of Plaintext. We first describe the method of collecting plaintext; 
Stage 1 of the attack. Here, plaintext having a long encryption time is needed 
to apply the elimination table attack described in Section 2.4. Our approach 
therefore is to encrypt a fixed amount of randomly generated plaintexts and to 
examine the resulting distribution of encryption time. The following method is 
used to measure the delay caused by cache misses as accurately as possible. The 
characteristics of the CPU (Pentium III) used in this experiment are also taken 
into account, and the DES source code that we use is the one described in [10]. 
In this source code, it is declared to assign 4 bytes to each entry of S-box, since 
S-box and bit permutation are computed simultaneously, for faster performance. 
The encryption time measurement method is as follows. 



— Before beginning measurements, S-box data is deleted from the LI data 
cache. In actuality, 16 kilobytes of random data are loaded into the 16- 
kilobyte data area of the LI data cache to fill it. 

— The rdtsc instruction, which loads the value of the processor’s time stamp 
counter into a register, is used to measure encryption time ; the instruction 
is executed directly before and after encryption and the difference between 
the obtained values is used to compute the encryption time. 

The above method enables to measure the encryption time for any plaintext 
and to collect plaintext/ciphertext pairs required for obtaining key differences. 
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Table 1. Experimental Environment 



PC 


NEC MateNX MA60J 


CPU 


Intel Pentium III(Katmai) 600MHz 


LI data cache 


16-KB 4- Way Set Associative Cache 32-byte cache line 


L2 cache (size / speed ) 


512KB / Half (300MHz) 


Bus clock 


lOOMHz 


OS 


Microsoft Windows2000 SP3 


Compiler 


Microsoft Visual C++ 6.0 SP5 


Compile option 


Maximize Speed (/02) 




encryption time 



Fig. 4. Relationship Between Number of Cache Misses and Encryption Time 



Relationship between Encryption Time and Cache Misses. We inves- 
tigated whether the collected plaintexts actually operate as expected. Fig. 4 
shows number of cache misses versus encryption time for the randomly gener- 
ated plaintexts. We used a single arbitrary key for our experiment to measure 
the frequency of cache miss. Fig. 4 also shows that the number of cache misses 
increases as encryption time becomes long. 

Fig. 5 shows the relationship between the number of plaintexts and the num- 
ber of cache misses, for randomly generated plaintexts and plaintexts having a 
long encryption time. These results confirm that plaintexts having a long en- 
cryption time include significantly more plaintexts causing many cache misses 
than randomly generated plaintexts. 

Making an Elimination Table Attack (Obtaining Key Differences). 

This part describes the method of obtaining key differences; Stage 2 of the attack. 
It is guessed that input values to the S-boxes of round 1 differ respectively from 
those to the corresponding S-boxes of round 16. Thus, Eq.(3) must hold. 



Kl © A16 ^ E{m) © E(i?15) 



( 3 ) 
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number of cache misses 

Fig. 5. Relationship Between Number of Plaintexts and Number of Cache Misses 



Based on the concept presented in Section 2.4, the value appearing least 
frequently among those obtained by E{R0) 0 E{R15) is highly likely to be the 
correct key difference, when providing enough plaintexts collected in Stage 1. 
Thus, it can be determined that the value appearing least frequently as E{R0) 0 
E{R15) is the correct key difference. This computation is performed for each 
pair of S-boxes SI through S8 in rounds 1 and 16 to obtain eight key differences. 

However, the 2 bits from the LSB; Least Significant Bit side of each key 
difference are indeterminate. This is because a cache miss does not occur if the 
input values of the 2 S-boxes under comparison differ to each other by the value 
within the range of the cache load size This means that the difference by the 
value less than the cache load size is ignored. Thus, the adjacent values of the 
value to be counted least frequently as a key difference are not counted, if the 
non-elimination table attack is applied. This is true to the adjacent values of 
the value to be counted most frequently, when elimination table attack is made. 
Thus, a key difference can be obtained, but the bits from the LSB side of it 
are still indeterminate; the 3 bits from that are be theoretically indeterminate. 
In our experiment, however, the 2 bits from the LSB side were found to be 
indeterminate because of absence of S-box addresses on a 32-byte boundary . 

Considering above, we guess that the 4 bits from the MSB; Most Significant 
Bit side of each obtained key difference are correct, when recovering the secret 
key. 



Recovering the Secret Key. The secret key is recovered from the 8 key 
differences obtained in Stage 2 in the following way. 

Step 1. Prepare one plaintext/ciphertext pair by encrypting any plaintext with 
the actual secret key. 

® The Pentium III Processor has a 32-byte cache load size i.e. 8 entries will be loaded 
simultaneously if it is declared to assign 4 bytes to each entry of S-box. 
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Table 2. Experimental Results 



Number of 

Plaintext/Ciphertext Pairs used as 2"“™^ 


Number to be 
substituted for 2“™^ 


Probability of 
Success 


216 


2-6 


68.7% 


2 ~‘ 


74.7% 


2-6 


85.0% 




2-6 


90.7% 


2"' 


92.3% 


2-6 


97.0% 



Step 2. Determine 1 bit of the expanded key for round 16 by using the obtained 
key difference between round 1 and round 16 and then guessing any 1 bit 
of the expanded key for round 1. In this way, make a 32-bit (4 bit x 8 
key differences) exhaustive search on the expanded key for round 1 with 
respect to the previously obtained key differences; this allows determining 
the expanded key for the corresponding round 16. 1 or more bits can be 
also determined by guessing 1 bit, based on the relationship between the 
two expanded keys for round 1 and round 16, which is described Section 
3.1. Consequently, secret key is guessed by 24-bit exhaustive search on the 
expanded key for round 1 . See the appendix for a detailed description on the 
secret-key recovery method. 

Step 3. Encrypt the plaintext prepared in Step 1, using the secret key guessed 
in Step 2. If the resulting ciphertext agrees with the one obtained in Step 1, 
the secret key is correct. If they do not agree, return to Step 2. Note that 
if the secret key cannot be recovered by a 24-bit exhaustive search, the key 
differences guessed in Section 3.2 are mistaken. 

3.3 Results of Experiment 

Table 2 lists the results of DES elimination table attack described in Section 3.2. 
For the attack, we use 2”“™ out of 2” randomly generated plaintexts which are 
collected in order of decreasing duration of encrypting. In reality, three numbers 
of 2“®, 2“^ and 2“® were taken as 2“™ to compare the probability of success of 
the attack, while two numbers of 2^® and 2^^ were used as 2"“™. The experiment 
was performed using 300 secret keys for two parameters; the number of plaintexts 
and the number to be substituted for 2“™. 

The results shown in Table 2 tell us that the secret key is recovered with 
a probability > 90%, when collecting 2^^ plaintext/ciphertext pairs and that 
setting a stricter condition for collecting plaintexts enables to collect the plain- 
text/ciphertext pairs having more cache misses. 

3.4 Discussion 

The above sections described a technique for breaking DES and the results of 
making attacks. Those results, however, are dependent on the experimental envi- 
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ronment specified in Table 1. Since a cache attack is a type of side-channel attack, 
there is a high possibility that the results will vary significantly according to the 
environment of computer. It is also thought that the result and its efficiency will 
vary according to source code. The following discusses these factors. 

A cache attack infers the frequency of cache miss from side-channel informa- 
tion and uses it to obtain key differences. As a consequence, S-box size can have 
a great effect on the attack. In the DES source code used in our experiment, 
it is declared to assign 4 bytes to each entry of an S-box (referred to below as 
int type). For the S-box declared as int type, eight entries are loaded into the 
cache per 1 cache load. Thus, considering that an S-box of DES has 64 entries 
in all, all S-box data will be loaded into the cache if eight cache misses occur. In 
contrast, for an S-box, for which it is declared to assign 1 byte per entry (referred 
to below as char type), 32 entries are loaded into the cache per 1 cache load; 
this means that only two cache misses are needed to occur, to load all S-box 
data. It therefore seems impossible that the duration of encryption determines 
the frequency of cache miss and that useful plaintexts are selected and collected. 
For confirmation, we applied an experimental attack on DES with the source 
code described in [10], after changing only the S-box declaration type from int 
to char, to find that the attack failed entirely. However, when a 32-bit processor 
such as Pentium III is used, the int type data is processed faster than char type 
data. Thus, the data which can be declared as char type will often be declared 
as int type, when implementing ciphers. The kind of implementation for faster 
processing can lead to the vulnerability to cache attacks. 

3.5 Attack on Triple-DES 

In this section, we consider whether the above cache attack on DES can be made 
on Triple-DES. Triple-DES performs a DES process three times in the form of 

• Encryption - Decryption - Encryption, or 

• Encryption - Encryption - Encryption. 

In addition, there are three ways of using keys, as follows. 

(a) K1 - K2 - K3 

(b) K1 - K2 - K1 

(c) K1 - K1 - K1 

Repeating DES three times in this manner makes greater resistance to crypt- 
analysis techniques like differential and linear cryptanalysis that employ the cor- 
relation of round functions. At the same time, secret key variations (a) and (b) 
feature a longer key length than DES, making it all the more difficult to perform 
an exhaustive key search. 

In any of the above Triple-DES variations, an S-box operates 48 times; three 
times as many as DES. Still, if operation delay due to cache misses can be 
measured, it should be possible to make a cache attack against Triple-DES in 
the same way as DES. 
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Table 3. The type of S-box of cipher and the technique of applying the cache attack. 
Ssize represents the number of S-box entries while Snum stands for the number of 
the times that S-box look-up is performed. Smias represents the maximum possible 
number of the times that cache miss is caused by S-box look-up. Technique shows the 
combination of the type of the plaintexts used for cryptanalysis and the type of the 
technique of applying attack 





Ssize 


Snum 


Srniss 


T echnique 


DES 


64 


16 


8 


plaintexts with long encryption time 
and elimination table attack 


MISTYl (S9)| 


512 


48 


64 


Camellia 


256 


36 


32 


AES 


256 


160 


8 


plaintexts with long encryption time 
and non-elimination table attack 



It is guessed that, similarly to DES, the key difference between round 1 and 
48 can be determined. For cryptanalysis on an actual computer, we can expect 
2 bits from LSB side of the key difference to be indeterminate and that the 
actually computed key difference consists of 4-bits x 8 = 32 bits. Thus, for secret 
key variation (a) having a key length of 168 bits, the 32 bits of K3 can first 
be determined by guessing the 32 bits of Kl. Then, if an exhaustive key search 
is performed on the remaining 104 bits(= 168 — 64), it should be possible to 
break the cipher in 2^^® calculations. This concept also holds for the other key 
variations, that is, it should be possible to break Triple-DES in a more efficient 
way than applying an exhaustive key search. 



4 Other Ciphers 

We made experimental cache attacks on AES and Camellia. Based on the results 
of the attacks, this section discusses the relationship between the number of the 
times that S-box look-up is performed and the cache attack. 

4.1 Results of the Experiment 

Table 3 shows the type of S-box and the technique of applying the cache attack 
for each cipher. Information on DES and MISTYl is also given in the table for 
comparison. The following outlines the technique of applying cache attack on 
Camellia and AES. 



Camellia. The source code is first modified by techniques for speeding-up the 
cipher which are recommended by the designer and described in the specification 
[3]. For each of the four S-boxes declared by the speeding-up techniques, the 
frequency of occurrence of cache miss is directly proportional to the encryption 
time, as is observed for DES. The usage of this property and 2^® plaintexts with 
long encryption time provides obtaining 168-bit key differences concerning with 
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Fig. 6. Correlation Between the Encryption Time and the Average Number of Cache 
Misses (on AES). This graph also represents the encryption time distribution of plain- 
texts 



a 256-bit equivalent key composed by the subkeys on round 1 through round 4, 
and the subkeys that are activated by the initial processing. Using the obtained 
key differences, approximately 2^^ computations provides the recovery of the 
secret key. 

AES. We employed the available source code in [7]. No correlation is found 
between the frequency of occurrence of cache miss and the encryption time. (See 
the plot labelled “All rounds” in Fig. 6) However, studies on the 16 S-boxes used 
at the beginning of the algortithm have shown the correlation that lower fre- 
quency of cache misses implies longer encryption time. (See the plot labelled 
“Round 1” in Fig. 6) This property provides 96-bit key differences through col- 
lecting 2^® plaintexts with long encryption time and regarding the value counted 
most often as a correct key difference. A 32-bit brute-force search using these 
key differences allows recovering the secret key. 

4.2 Discussion 

According to the paper [8] written by Ohkuma et ah, the cache attack is theoret- 
ically feasible even if the number of the times that S-box look-up is performed is 
fairly large. The following equation represents the probability that the value of 
the frequency of cache miss is n, where N and M stand for the number of S-box 
input and the number of the times that S-box look-up is performed, respectively. 

This equation also indicates that it is theoretically feasible to break a cipher, 
if the cipher has the possibility that the value of the frequency of cache miss 
varies, depending on the collected plaintexts. 
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The accurate difference between the values of the frequency of cache miss is 
hard to obtain, when the attack is applied using a practical computer. When 
the cipher (e.g. AES) does not cause significant difference between the values of 
the frequency of cache miss, regardless of the plaintexts used for the attack, it is 
hard to perform the cryptanalysis using the values of the frequency of cache miss 
and the encryption time. However, we broke such kind of cipher, by utilizing the 
correlation between the encryption time and the probability that the values for 
some of S-box inputs are identical. When the ciphers cause fewer S-box look- 
ups and significant variations in the frequency of cache miss, like DES , we can 
expect that the frequency of cache miss and the encryption time correlate to each 
other. The correlation between the encryption time and the probability that the 
values for some of S-box inputs are identical, which is used for the cryptanalysis 
of AES, however, varies significantly, depending on the type of CPU and the 
method of implementing source code. For example, the durations of encrypting 
two sets of plaintexts with the same values of total frequency of cache miss 
on Intel Pentium III processor are sometimes different, depending on whether 
or not the cache misses occur continuously at the beginning of the encryption. 
In addition, which core is used for Intel Pentium III processor, Coppermine or 
Katmai decides which S-box to use for cryptanalysis and which attack to apply, 
non-elimination table attack or elimination table attack. In this case, the cipher 
can be broken, if we take possession of the source cord of the target cipher in 
advance and find the values of S-box inputs whose probability of being identical 
correlates to the encryption time. 



5 Conclusion 

We have shown that the Data Encryption Standard (DES) can be broken with 
2^^ known plaintexts and 2^"^ calculations at a success rate > 90%, using a 
personal computer with 600-MHz Pentium III. We have also shown that a cache 
attack can be made against a cipher using S-boxes of different input/output 
widths or S-boxes of several types. Furthermore, in applying this cache attack 
to Triple-DES, it was found that there is a high possibility of it being broken 
more efficiently than an exhaustive key search. 

This paper reports applying cache attack using a personal computer. In 2002, 
cache based cryptanalysis [9] was proposed where cache hits and/or cache misses 
are observed by the use of electric power or magnetic force. Since the next 
generation of 32-bit smartcards will use cache memories, the combination of the 
cache attack we proposed and Power Analysis attacks could probably be a more 
effective cryptanalysis technique. 

We also consider countermeasures against cache attacks; a cache attack infers 
the number of times of occurred cache misses by observing the encryption time. 
Thus, if a total-data load is executed before processing, differences between the 
frequencies of cache misses will not be observed, making it impossible to deter- 
mine the relationships between sets of S-boxes. If it is possible to clear a cache 




Cryptanalysis of DES Implemented on Computers with Cache 



75 



during the encryption, generating noise that has no relation with encryption at 
random time intervals is an effective countermeasure against cache attacks. 

The cache attacks are newer technique in comparison with the timing at- 
tacks on RSA. The encryption efficiencies can be enhanced by the studies to be 
conducted in future. 
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Appendix: Secret-Key Recovering Method 

The following describes the process for recovering a secret key. As is described 
in the body of this paper, the following precondition must be satisfied to recover 
a secret key. 
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— The 4 bits from the MSB side of each key difference between SI through S8 

S-boxes of round 1 and round 16 can be obtained. 

In the following, nth bit from the MSB side of the variable X is defined X[n]. 
We take the expanded key K1 of round 1 as an example: 

K1 = /a[l]||ifl[2]|| • • • ||iM[48] 

Next, based on key-schedule structure, the relationship between computed 
key differences and secret-key information C and D can be represented in the 
following way. 

Kl © K16 = PC2{Ci) © PC2{Cie) 

= PC2{Ci) © PC2{LS{Ci)) (4) 

Using Eq. (4), Ci and D\ information can be computed in a step-by-step 
manner. The following 16 equations are Eq. (4)s expressed on a bit basis. 

Kl[7] © K16[7] = Cl [3] © Cl [4] 

Kl[16] © K16[16] = Cl [4] © Cl [5] 

/a [10] © ia6[io] = Cl [6] © Cl [7] 

Kl[20] © 7a6[20] = Cl [7] © Ci[8] 

Kl[3] © 7716 [3] = Cl [11] © Cl [12] 

771 [15] © 7716[15] = Ci[12] © Ci[13] 

771 [1] © 7716[1] = Cl [14] © Cl [15] 

771 [9] © 7716[9] = Ci[15] © Ci[16] 

771 [19] © 7716[19] = Ci[16] © Ci[17] 

771 [2] © 7716 [2] = Ci[17] © Ci[18] 

771 [14] © 7716[14] = Ci[19] © Ci[20] 

771 [22] © 7716 [22] = Ci[20] © Ci[21] 

771 [13] © 7716 [13] = Ci[23] © Ci[24] 

771[4] © 7716[4] = Ci[24] © Ci[25] 

771 [21] © 7716[21] = Ci[27] © Ci[28] 

771[8] © 7716[8] = Ci[28] © Ci[l] 

Using the 16 equations above to guess 7 bits of Ci[3], Ci[6], Ci[ll], Ci[14], 
Ci[19], Ci[23], and Ci[27] allows obtaining 16 bit of Ci[4], Ci[5], Ci[7], Ci[8], 
Ci[12], Ci[13], Ci[15], Ci[16], Ci[17], Ci[18], Ci[20], Ci[21], Ci[24], Ci[25], 
Cl [28], and Ci[l]. 28 bits, i.e. all bits of Ci are obtained by guessing 7 bits 
in the way described above and then guessing the remaining 5 bits of Ci; Ci[2], 
Cl [9], Cl [10], Cl [22], Cl [26]. 

Di is treated similarly. 12-bit exhaustive search on 71i[2], Ci[5], 7?i[6], 7?i[7], 
£>i[8], Ci[ll], Ci[16] 71i[20] Ci[21], 71i[26], Ci[27], and £>i[28] allows determin- 
ing the 28-bit of Dl . 

Overall, the above uniquely recovers 56 bits of the secret key by guessing 24 
bits. 
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Abstract. In this paper we describe a differential fault attack technique 
working against Substitution-Permutation Networks, and requiring very 
few faulty ciphertexts. The fault model used is realistic, as we consider 
random faults affecting bytes (faults affecting one only bit are much 
harder to induce). We implemented our attack on a PC for both the 
AES and Khazad. We are able to break the AES-128 with only 2 faulty 
ciphertexts, assuming the fault occurs between the antepenultimate 
and the penultimate MixColumn; this is better than the previous fault 
attacks against AES[6,10,11]. Under similar hypothesis, Khazad is 
breakable with 3 faulty ciphertexts. 

Keywords: AES, Block Ciphers, Fault Attacks, Side-channel Attacks 



1 Introduction 

The idea of using hardware faults happening during the execution of a cryp- 
tographic algorithm for breaking it (typically, for retrieving the key) was first 
suggested in 1997 by D. Boneh, R.A. DeMillo, and R.J. Lipton [7,8]. They suc- 
ceeded in breaking an RSA CRT with both a correct and a faulty signature of 
the same message. Shortly after, an adaptation of this idea on block ciphers was 
proposed by E. Biham and A. Shamir[5j. 

Application of this principle to tamper resistant devices such as smart cards 
is a real threat (see e.g. [1,2]): by changing the power supply voltage or the 
frequency of the external clock, or by applying radiations, a fault can be induced 
with some probability during the computation. The faults induced by most of 
these techniques affect one byte^, as it is the size of a register for current smart 
cards; however it is often the case that the attacker cannot locate a priori at 
which stage of the algorithm the fault occurred. 

Several authors mounted differential fault attacks against the AES[6,10,11]. 
In this paper we present a fault attack working against any block cipher with 

^ Although progresses were recently made in inducing faults affecting only one bit[13]. 
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a Substitution-Permutation structure^. More precisely, its round function must 
have the form o 0^. o (r is the round number), where: 

— The 7r layer consists in the parallel application of n 8 x 8 S-boxes (not 
necessarily identical) . 

— a[k] denotes the key addition layer: 

u[k]{a) = b ^ bj = Gj (B kj,l < j < n 

where © denotes a group operation. As it is often exclusive or, in the following 
we will only deal with this case. But our attack could also work against other 
group operations. 

— The diffusion layer 6^ is a mapping that is linear with respect to ©. 

— AT'’ denotes the round key. 

We denote the block size by = 8n. Note that the fact that the S-boxes are 8x8 
is absolutely not mandatory for our attack; we restricted to this parameter as it 
is common to choose such a size, well fitted with implementation considerations. 
4x4 and 2x2 S-boxes can be viewed as 8 x 8 S-boxes as well, by considering 
groups of 2 (resp. 4) of them. 

The last round of the cipher has the special form a[K^\ o as a 0 layer at 
this stage would have no cryptographic significance. Moreover, the first round is 
preceded by a key addition layer. Thus the whole cipher can be described as: 



a[K^] o o ( O o-[AT’'] o 9^ o 7^) o a[K°] 

r—1 

Remark that the 7^. and 9r layers need not be identical for all rounds. Only the 
last two rounds are important for our attack. They are depicted in Fig. 1. 



2 The Attack 

The Fault Model. We are dealing with random faults, in the sense that the 
faulty value is random and is assumed to be uniformly distributed. They are as- 
sumed to occur on one byte. Moreover we assume that the fault occurred some- 
where between the before-last layer dR -2 and the last layer 9r-i (i.e. somewhere 
inside the frame of Fig. 1). Under this condition, the exact stage of the compu- 
tation at which the fault occurred has no importance, and cannot be guessed by 
observing the ciphertexts either. 

In the remaining of this paper, by (C; C*) we always denote a right ciphertext C 
and its corresponding faulty ciphertext C* . Also, unless otherwise stated, indices 
will refer to byte positions (for example, Ci denotes the left-most byte of C). 

^ Strictly speaking, the designation substitution-permutation network implies that the 
diffusion layer is a bit permutation. However, it becomes more and more used to 
refer also to ciphers with a more complex diffusion layer. So do we. 
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Fig. 1. Last 2,5 rounds of a Substitution-Permutation Network. 



Basic Attack. Consider thus 1-byte differences at the input of the linear layer 
9r-x. We count 255n possible such differences (n different possible locations, 
and 255 different possible values). Because of the linearity, the number of cor- 
responding possible differences at its output is also 255n; but while the input 
difference affected one byte only, the output difference affects several ones, be- 
cause of the diffusion (if the diffusion layer is optimal, all bytes are affected; with 
some slower diffusions only a few of them may be affected). Note that the key 
addition cr[K^~^] does not change the set of possible differences. 

These considerations lead to a first sketch of attack. For simplicity we assume 
the On - 1 layer achieves optimal diffusion. 

1. Compute the 255n possible differences at the output of 9r-i, i.e. the 255n 
values 9r-i{x), where x has a byte hamming weight of 1. Store them in a 
list T>. 

2. Consider a plaintext P, C its corresponding ciphertext, and C* the faulty 
ciphertext. 

3. Take a guess on round key K^. 

4. Compute the difference 7)^^ o cr[K^]{C) © 7)^^ oa[K^]{C*). Check whether 
it is in V. If yes, add the round key to the list C of possible candidates. 

5. Consider a new plaintext P (with corresponding C and C*) and go back to 
step 2 (this time round key guesses only go through the list £ of possible 
candidates; if the difference computed at step 4 is not in T>, remove the 
candidate from £). Repeat until there remains only one candidate in £. 

If the diffusion layer is slow, only a limited number of bytes of the cipher 
are affected by a given fault. Thus each pair (C;C*) gives information only on a 
subset of the round key bytes; the guess is made only on theses bytes. The AES 
is a good example of this fact. 

After the last round key has been found, and if it is not sufficient to retrieve 
the whole key, the last round is peeled off, and the attack is repeated on the 
reduced cipher. 
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Complexity Analysis. We compute the fraction of the round keys that 
are suggested by a single pair (C; C*) with difference Z\ = C © C*. Suppose 
the number of possible differences A' before 7 /j is 255". Among these, 255n are 
elements of T>. Thus the fraction of the keys surviving the test is 255n/255" = 
n- 2551 -". 

However this computation does not take into account the fact that the num- 
ber of differences A' before jr that can cause difference A after it is far less 
than 255"; this is due to the fact that the XOR distribution table^ of each S-box 
contains a lot of O’s. Thus we made the hypothesis that for the observed A, 
the fraction of elements of T> among the corresponding possible A' is also about 
255n/255". 

We conclude that the number of remaining wrong candidates for after 
N (C; C*) pairs have been treated is about 256"(n • 255^“")^. The conclusion 
(for all practical values of n) is that one pair (C; C*) is not enough to retrieve 
K^, but two are (still under the hypothesis that the diffusion layer is optimal; 
see the AES case in section 3 for an example where it is not). 



A Practical Attack. As it is presented, this attack is not really practical, as 
it implies a guess on the last round key, that is to say a complexity ~ 2^*’. We 
show that slightly modifying the attack considerably reduces this complexity. 
Once again, for simplicity we assume the diffusion layer considered is optimal. 
A similar technique, applied only to the bytes affected by the fault, can be used 
when it is not. 

1. Compute the 255n possible differences at the output of d_R_i. Store them in 
a list T>. 

2. Consider 2 right ciphertext/faulty ciphertext pairs (C';C'*) and (D;D*). 

3. Consider the two left-most bytes of K^: 

— For each of the 2^® candidates, compute'^: 

7))'oa[(Af,A2«)]((Ci,C2)) © 7R^oa[{K^, Kernel C*,)) 

and 

7 ))'oa[(Af,A2 «)]((Oi,Z?2)) © o a[{K^, K^)]{{Dl D*)) 

— Compare the results with the two left-most bytes of the 255n differences 
in list T>. Make a list C of the {K^, K^) for which a match is found for 
both ciphertext pairs. 

4. For each K* G C, try to extend it by one byte: 

— Remove K* from £. 

® See [ 4 ] for definition of this concept. 

We commit a small abuse in notations by applying a and 7/; to data of improper 
length. The right way to understand this is to think that e.g. (Ci , C2) has been right- 
appended with O’s, and that only the 2 left-most bytes of the output are considered. 
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— For all 2® K^, compute: 






and 



7«ioa[(X*,iF3«)](p2,Z?3)) © l],^oa[{K;,K^)]{{D;,D;)) 

— Compare the differences obtained with bytes 2 and 3 of the 255n differ- 
ences in T>. If a match is found (again for both ciphertext pairs), add the 
newly extended key (K*,K^) to £. 

5. Repeat step 4 until elements of £ have a length of n bytes. 

6. Apply now the first algorithm we gave using the same pairs (C;C*) and 
{D; D*), but consider only the candidates in £ (their number is much 
smaller than 2^*’). 

The idea of this algorithm is that its first 5 steps compute a set of candidates 
of which the candidates selected by the first algorithm are a subset; otherwise 
stated, every candidate obtained by applying the first algorithm to pairs (C; C*) 
and {D; D*) will be returned by steps 1— >-5 of the second algorithm too, but the 
converse is not true. Thus, the first 5 steps of the second algorithm (that have a 
low complexity) perform a “first sorting” of the candidates. After that, the size 
of the set of candidates is quite small, and is affordable for the first algorithm. 

Faults Occurring at a Wrong Location. As the attacker usually has no 
control on the fault location, it is important to be able to distinguish pairs 
{C]C*) resulting from faults occurring between 0^-2 and 9r-\ (we call such 
pairs right pairs) from other pairs (these are called bad pairs). It is trivial in 
the case of diffusion layers for which a 1-byte difference in the input implies 
an output difference affecting only some bytes of the output: in this case it is 
enough to observe whether some bytes are identical in both C and C*. 

But in the case of optimal diffusion layers, it is not possible to decide whether 
one only pair (C; C*) is a right or a bad one. However applying our attack to 2 
pairs (C; C*) one of which is bad will very probably result in no solution for the 
key K^. Thus we can indeed distinguish bad pairs (C; C*) from right ones, but 
only by considering pairs of ciphertext pairs (C';C*). Nevertheless the attack 
should be practical: if we consider that 1 ciphertext pair out of 100 is right, 
which is more than reasonable, we have 10000 pairs to examine before finding 
two right pairs, which is still accessible. 

3 Application to the AES 

3.1 Overview of the AES Structure 

The AES [9] is an example of a substitution-permutation structure, as it is defined 
in the introduction. Both its key and block size can be 128, 192, or 256 bits. In 
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Table 1. The State during AES encryption 



0 

0 


O0,l 


Oo,2 


Oo,3 


dl,0 


Ol,l 


Ol,2 


Ol,3 


02,0 


02,1 


02,2 


02,3 


03,0 


03,1 


03,2 


03,3 



this paper we will only deal with the 128-bit block, 128-bit key variant, as it is 
the most widely used. Our attack can be extended trivially to other variants. 

The key addition is performed using exclusive or. The 7 layer (identical for 
all rounds) is made up of the application of 16 identical 8x8 S-boxes. The 
intermediate computation result, called state is usually represented by a 4 x 4 
square, each cell of which is a byte (see Table 1); the 9 layer (identical for all 
rounds) is the composition of two transformations of the state: 

1. First, the ShiftRow transformation consists in shifting cyclically the rows of 
the state. Row 0 is not shifted, row 1 is shifted by 1 byte, row 2 is shifted 
by 2 bytes, and row 3 by 3 bytes. It is pictured in Table 2. 

2. Then, the MixColumn transformation applies a linear transformation with 
optimal byte branch number(i.e. 5) to each column of the state. More pre- 
cisely, application of MixColumn to the first column of the state (for example) 
is computed by: 



bofi 




02 03 01 01' 




0-0,0 


bifi 




01 02 03 01 




0-1,0 


^2,0 




01 01 02 03 




02,0 


p3,0_ 




03 01 01 02 




O3,0_ 



where multiplication is performed in GF(2®) (via definition of an irreducible 
polynomial of degree 8 over GF(2), see [9] for details). 



Table 2. The ShiftRow transformation 
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Ol,3 


02,0 


02,1 


02,2 


02,3 


03,0 


03,1 


03,2 


03,3 



oo,o 


: — 1 

o' 


O0,2 


Oo ,3 


Ol,l 


Ol,2 


Ol ,3 


Ol,0 


02,2 


02,3 


02,0 


02,1 


03,3 


03,0 


03,1 


03,2 



We observe that a 1-byte difference in the state before the 9 layer results in 
a 4-byte difference after it. This property is important for our attack. 

Note also that the last round has no MixColumn, but well a ShiftRow. The 
reason behind it is purely implementation related. This last ShiftRow has no 
cryptographic significance, and is not relevant to our attack either. 
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3.2 Previous Works about Fault Analysis on the AES 

Several papers have been written on the subject. We summarize here their con- 
tributions, by chronological order: 

The first paper we know of is the one of Blomer and Seifert [6]. Mainly, two 
attacks are presented: 

~ The first one uses a very restrictive fault model: namely, it assumes that one 
can force to 0 the value of a chosen bit. It is worth noting that applying this 
technique to the memory cells storing the key makes it trivial to retrieve, 
even without being able to choose precisely the location of the bit set to 0. 
This has been demonstrated by Biham and Shamir [5], and is true for any 
algorithm. [6] shows however that the fault model can be slightly relaxed. 
128 faulty encryptions of plaintext 0 are required to retrieve the key using 
this technique. 

— The second attack is implementation-dependent, and has several variants 
depending on the implementation. Its principle is to turn the timing attack 
on AES suggested by Koeune and Quisquater[12] into a fault based crypt- 
analysis. The fault model used also depends on the implementation. The 
authors claim that about 16 faulty ciphertexts (with the fault occurring at 
a carefully chosen location) are needed to retrieve one key byte. 

In [11], Giraud presents two fault attacks on the AES. Both require the 
ability to obtain several faulty ciphertexts originating from the same plaintext 
(contrary to our attack): 

— The first one assume it is possible to induce a fault on only one bit of an in- 
termediate result. More precisely, it exploits faults induced on one bit before 
the last 7 layer (while we exploit faults occurring before the last diffusion 
layer). Under these conditions, about 50 faulty ciphertexts are necessary to 
retrieve the full key (provided the location of the fault can be chosen). 

— The second attack, more realistic, exploits faults on bytes. It requires the 
ability of inducing faults at several chosen places, including the key schedule. 
The author claims that 250 faulty ciphertexts are needed (it is assumed that 
the attacker can choose the stage of the computation where the fault takes 
place, but not the exact byte), and that the time needed for computation is 
about 5 days. 

Finally, P. Dusart, G. Letourneux, and O. Vivolo[10] take advantage of byte 
faults occurring after the Shif tRow layer of the 9*^ round. Thus the fault model 
and the hypothesis on the fault location are exactly the same as in our attack. 
However the way they exploit faults is different from ours: they use the particular 
form of the Mixcolumn transformation and of the AES S-box to write and solve 
a system of equations (one by S-box) of which the unknown value is the one of 
the fault (i.e. of the byte difference engendered by the fault). Suggestions for 4 
key bytes follow. The authors show that 5 well-located faults are necessary to 
retrieve 4 key bytes. 
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3.3 Our Results 

As observed above, a 1-byte difference at the input of the 9 layer of AES results 
in a 4-byte difference at its output. Concretely, it means that a fault on one byte 
before the i layer will give information on only 4 bytes of the last round key 
(the other bytes of both ciphertexts C and C* being identical) . More precisely, 
with the different bytes of the state numbered as in Table 1: 

— A fault on byte ao,o> oi,!; a2,2> or 03^3 will release information about round 
key bytes 

— A fault on byte Oop, Oip, 02,3, or 03^0 will release information about round 
key bytes ATf^o. ^2^3 . 

— A fault on byte oop, oip, 02, o> or 034 will release information about round 
key bytes 

— A fault on byte 00,3, <12,1, or 03^ will release information about round 

key bytes AT^g, ^2^1 > ^po- 

Consider a fault occurring on one of the bytes ao,o> ai,i) 02,2, or ogp. We com- 
pute that with one pair (C; C*) about 1036 candidates for {Kq q, Ai/^g, AT^2i ^3^1) 
remain (see complexity analysis in section 2). If two pairs are exploited, we are 
in principle left with the right candidate only. Thus with 8 faults at carefully 
chosen locations, we are able to recover the whole key. 

However it is possible to do better. Suppose a fault occurs on one byte 
somewhere between 0 _r_ 3 and 0 _r_ 2 (rather than between 9 f >-2 and 0 r-i). The 
corresponding difference after the 0 _r_ 2 layer has 4 non-zero bytes. Each of 
them can be exploited as described previously, and releases information about 
a different part of the last round key. For example, a fault on oo,o before 0 _r_ 2 
will result in a non-zero difference on ao,o, ®i,0) 0-2,0, and 03^0 after it. Thus 
using faults occurring somewhere between 6*/j_3 and 0 _r_ 2 allows us to kill 4 
birds with one stone. As a consequence, only 2 such faults are needed to retrieve 
the whole AES-128 key. 

We implemented our attack on a PC. The results obtained well matched our 
estimates. 

When one fault between Oji-2 and 9 ji-i was considered, the average number 
of candidates for 4 bytes of obtained was 1046 (instead of the expected 
1036). A more surprising point (a priori) was that 2 pairs (C;C*), both giving 
information on the same 4 bytes of allowed to retrieve a unique value for 
these bytes in only 98% of the cases; otherwise two possible values for these 4 
bytes remained (or even four, but it was very rare). These deviations from the 
expected results are due to the fact that we were making very few hypothesis on 
the 9 layer and the S-boxes in our complexity analysis. Thus our estimations did 
not take into account particular features of these components. We give a more 
detailed explanation for the 98% figure in appendix A. 

Using 2 faults between d_R_3 and 0 _r_ 2, the number of candidates left for the 
whole key never exceeded 16, and we obtained one only candidate in 77% of 
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the cases. The time needed to complete the attack is a few seconds. Also, when 
applying the attack to 2 ciphertext pairs one of which is bad (i.e. corresponds to 
a fault occurring before On- 3 ), the set of candidates returned by our algorithm 
was always empty. 



4 Application to Khazad 

4.1 Brief Description of KHAZAD 

KHAZAD is a 64-bit block 128-bit key block cipher submitted to the NESSIE 
European project by P.S.L.M. Barreto and V. Rijmen[3]. It has 8 rounds, whose 
structure is the one described in section 1 , with exclusive or used for key addition. 
Its 7 layer (identical for all rounds) is made up of the application of 8 identical 
involutive 8 x 8 S-boxes. Its 6 layer (also identical for all rounds) has optimal 
byte branch number (i.e. 9) and is also involutive. 



4.2 Our Attack Applied to KHAZAD 

Two faults occurring between Or-i and 0 _r _2 are enough to retrieve (as 
each fault gives information on all bytes of remember that 9 is optimal). 
However knowledge of is not enough to retrieve the whole key. Thus once 
is known the last round is peeled off. Then a fault occurring between 9 r -2 
and 9 r -3 is exploited to select about 256® • (8 • 255“”^) ~ 2105 candidates 
for K^~^. We conclude the attack by searching exhaustively among these 
candidates; knowledge of and allows to compute the main key. 

Our implementation of the attack showed that using 2 right pairs {C;C*) 
we obtain one unique candidate for in about 90% of the cases (otherwise 2 
candidates remain, sometimes 4). One reason for this bad score happens to be 
related to the choice of the S-boxes: it seems that the worse an S-box is with 
respect to differential cryptanalysis, the better it resists our fault attack. As an 
illustration, we applied our attack to a modified version of KHAZAD using the 
AES S-boxes; then a unique candidate is obtained from 2 right pairs (C; C*) 
with probability 96%. Appendix A sketches an explanation for this. 

Note that once again, the number of faulty ciphertexts needed to retrieve the key 
is not affected by these figures; only the time complexity of the attack (which 
remains small anyway) is. Also, when trying to recover with 2 ciphertext 
pairs one of which is bad, the set of candidates returned by our algorithm was 
always empty. 



5 Conclusion 

The basic idea of our attack is to use the diffusion property of the last 9 layer, 
in order to determine whether the difference before the last nonlinear layer 7 
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possibly originates in a fault or not. This provides us with a distinguishing crite- 
ria for the last round key. The fault model used is the most liberal and realistic 
one: we simply need random faults occurring on bytes. The ability to choose the 
location of the fault is not important either: of course only faults occurring at a 
given location (between 9r-2 and in the general case, between 0 _r _3 and 

9n-i in the case of AES) are exploitable, but those occurring elsewhere can be 
discarded. 

We give in Table 3 a summary of existing faults attacks against the AES. 



Table 3. Comparison of existing fault attacks against the AES 



Ref. 


Fault Model 


Fault Location 


^ Faulty Encryptions 


[6] 


Force 1 bit to 0 


Chosen 


128 


[6] 


Fct of impl. 


Chosen 


256 


[11] 


Switch 1 bit 


Any bit of chosen bytes 


- 50 


[11] 


Disturb 1 byte 


Anywhere among 4 bytes 


250 


[10] 


Disturb 1 byte 


Anywhere between 6 r -2 and 6 r -\ 


40 


This paper 


Disturb 1 byte 


Anywhere between ^r _3 and 0 r_2 


2 



Amongst these attacks, the most similar to ours is the one of P. Dusart & aL[10]. 
The difference mainly lies in the way faults are exploited. [10] exploits the par- 
ticular structure of the AES S-box and MixColumn, while we do not. The con- 
sequence is that their attack is not adaptable to other algorithms; ours can be 
used to attack KHAZAD(as we showed in section 4), but also ciphers like Ser- 
pent or Anubis®. On the other hand, note that an algorithm such as Safer-|--I- 
is not directly vulnerable to our attack, due to the use of two different group 
operations for key mixing. 

In our attack against AES, note that while 2 well-located faults are needed 
for easy retrieving of the key, one only well-located fault reduces the size of the 
key space to be explored to 1046^ ~ 2'^°. 

As a final remark, it is amusing to note that it is the very simple and elegant 
structure of SPN structures that makes our attack so efficient... It is not clear 
whether ciphers with a more intricate structure could be broken with so few 
ciphertext pairs. 
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A Deeper Analysis of the AES Case 

In this appendix we analyze why 2 right pairs (C; C*) and (D;D*), both 
releasing information on the same 4 bytes of K^, do not allow to compute an 
unique value for these 4 bytes in about 2 % of the cases. 



Let := (Kq q,Ki ;^,K2^2t^3,i) denote 4 bytes of the last round key of 
an AES. Let C, := Ci^3, C2,2) and C* := (Cg q, C* 3, 2? Cg be a 

right ciphertext and its faulty counterpart, both limited to the same 4 bytes. 

It is easy to see that applying our attack to pair {C]C*) will return, together 
with and other candidates, © C, © C* and the 14 other candidates 
obtained when only some bytes of C, © C* are XORed to K^. 

Consider a second pair {D; D*) with D, := (£*0,07 U2,27 -Dgp) and 
£)* := (L?o, 07 -^1,37 -^2,27 ^3,1); is the faulty counterpart of D,. As- 
sume D, © £)* share some bytes with C, © C*; suppose for example 
Co.o © Cg^o = ^0.0 ® Dq q. Then, © (Co,o ® Cg 0, 0, 0, 0) will be returned by 
our attack (applied to both (C; C*) and (D; D*)) as well as K^. 
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As the probability of having the same value at a given position of C© C* and 
D © D* is 1 /255, the probability that we observe the same value at at least one 
position is 1 — (254/255)'^ ~ 0, 015. So we have found the main reason why more 
than one key is returned in 2% of the cases. Note that this phenomenon is not 
specific to AES; furthermore this explanation could be generalized by referring to 
the XOR distribution table [4] of the S-boxes. It appears then that paradoxically 
good S-boxes with respect to differential cryptanalysis are also those making our 
fault attack the most efficient... 
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Abstract. To protect a cryptographic algorithm against Differential 
Power Analysis, a general method consists in masking all intermediate 
data with a random value. When a cryptographic algorithm combines 
boolean operations with arithmetic operations, it is then necessary to 
perform conversions between boolean masking and arithmetic masking. 
A very efficient method was proposed by Louis Goubin in [6] to convert 
from boolean masking to arithmetic masking. However, the method in 
[6] for converting from arithmetic to boolean masking is less efficient. In 
some implementations, this conversion can be a bottleneck. In this paper, 
we propose an improved algorithm to convert from arithmetic masking 
to boolean masking. Our method can be applied to encryption schemes 
such as IDEA and RG6, and hashing algorithms such as SHA-1. 



1 Introduction 

The concept of Differential Power Analysis was introduced by Paul Kocher and 
al. in 1998 [7,8]. It consists in extracting information about the secret key of a 
cryptographic algorithm, by studying the power consumption of the electronic 
device during the execution of the algorithm. The attack was first described on 
the DES encryption scheme, then extended to other symmetrical cryptosystems 
such as the AES candidates [2], and also to public-key cryptosystems [5,11]. 

Subsequently, some countermeasures have been developed. In [3], Chari and 
al. proposed an approach which consists in splitting all the intermediate variables 
into a given number of shares, so that the power leakage of an individual share 
does not reveal any information to the attacker. They show that the number of 
power curves needed to mount an attack grows exponentially with the number 
of shares. A similar approach was also proposed by Goubin and al. in [5]. The 
drawback of this approach is that it greatly increases the computation time and 
the memory needed. This is a crucial issue for constrained environments such as 
smart-cards. 

Actually, when only two shares are used, this approach consists in masking 
all intermediary data with a random. This technique was evaluated by Messerges 
in [10] for the five remaining AES candidates. For algorithms that combine 
boolean and arithmetic operations, two different kinds of masking must be used: 
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boolean masking and arithmetic masking. This is typically the case for encryp- 
tion schemes such as IDEA [9] and RC6 [12], and hashing algorithms such as 
SHA-1 [13]. It is therefore necessary to perform conversions between boolean 
masking and arithmetic masking. The conversion itself must also be resistant 
against Differential Power Analysis. Messerges proposed in [10] an algorithm 
for converting between boolean masking to arithmetic masking and conversely. 
However, it was shown in [4] that both conversions were vulnerable to a more 
sophisticated Differential Power Analysis. 

A new conversion algorithm was proposed by Goubin in [6]. In both direc- 
tions, the conversion algorithm is such that all intermediary variable is randomly 
distributed; therefore, the conversion is provably resistant to first order DPA, 
in which no attempt is made to correlate the power consumption at different 
execution times. Moreover, the conversion from boolean masking to arithmetic 
masking is very efficient. However, the conversion from arithmetic masking to 
boolean masking is less efficient, as it requires a number of operations linear in 
the bit-size of the data to be masked. This conversion can be a bottleneck in 
some implementations. In this paper, we propose a secure and efficient method 
to convert from arithmetic masking to boolean masking. 

2 Definitions 

2.1 Boolean Masking and Arithmetic Masking 

In this section we recall some basic definitions. We assume that the size of all 
intermediate variables is k bits. A typical value for k is 32 bits, as for SHA-1 
and MD-5. The masking technique introduced in [3] consists in splitting each in- 
termediate data that appears in the cryptographic algorithm. Then, an attacker 
must analyze multiple point distributions, which requires a number of power 
curves exponential in the number of shares. As in [10], we apply this technique 
with two shares. For algorithms that combine boolean and arithmetic functions, 
two different kinds of masking have to be used : 

Definition 1. We say that a data x has a boolean masking when x is written 
as X = x' ®r where r is uniformly distributed. 

For example, assume that given x\, X 2 , we must compute X3 = xi © X 2 in a 
secure way. Then from the masked values x'l and x' 2 , such that X\ = x{(Bri and 
X2 = ^2 © ^2, we compute the two shares X3 = x\ © x '2 and rs = ri © r2, so that 
X 3 = x'^® r^. Each intermediary variable is then uniformly distributed, and the 
procedure is resistant against first order DPA. 

Definition 2. We say that a data x has an arithmetic masking when x is written 
as X = A + r mod 2^ where k is the size of the register and r is uniformly 
distributed. 

For example, assume that given x\, X 2 , we must compute X 3 = xi + X 2 in a 
secure way. Then from the masked values x'l and x' 2 , such that x\ = + ri and 
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X2 = X2 + T2, we compute the two shares X3 = x'l + x'2 and rs = ri + T2, so that 

X3 = X3 + r3. 

For algorithms that combine boolean operations and arithmetic operations, 
it is therefore necessary to provide a secure conversion algorithms in both direc- 
tions. 

2.2 Ftom Boolean Masking to Arithmetic Masking 

A very efficient method for converting from boolean masking to arithmetic mask- 
ing is given in [6] . It requires a number of elementary operations which is inde- 
pendent from the fc, the data bit-size. The method is based on the fact that for 
all x' the function 

fx'{r) = {x' ®r)-r 

is affine in r, which means that for all x',r\,r2, 

fx'{ri © T2) = fx'in) © fx'{r2) © x' 

Therefore, given x',r such that x = x' (Br, we generate a random fc-bit integer 
ri, and we can compute A = x — r mod 2^ as: 

A = fx'{r) = fx'{{ri © r) © n) = fx'{ri © r) © {fx'{ri) © x') 

Since ri and ri © r have the uniform distribution, the conversion is resistant 
against DPA. We refer to [6] for the proof that / is affine and for a detailed 
description of the algorithm. 

2.3 Ftom Arithmetic to Boolean Masking 

Louis Goubin proposed in [6] a method for converting from arithmetic to boolean 
masking, but the method is less efficient than from boolean to arithmetic. In 
particular, it requires a number of operations linear in the size of the registers; 
namely for a fc-bit register, the number of fc-bit operations is 5fc + 5. 

3 Our Conversion Algorithm 

We propose a new conversion algorithm from arithmetic to boolean masking 
which is generally more efficient than Goubin’s method. Our method is based 
on pre-computed tables. First, we describe our method for small register size k 
(typically, k = 4). 



3.1 Conversion with Small Register Size 

The algorithm uses a pre-computed table G of 2^ variables of k bits. 
Algorithm 1: table G generation. 

Output: a table G and a random r. 




92 



J.-S. Coron and A. Tchulkine 



1. Generate a random /c-bit r. 

2. For A = 0 to 2^ — 1 do 
G[A] ^ (A + r) © r 

3. Output G and r. 

Using this table, it is easy to convert from arithmetic to boolean masking: 
Algorithm 2: conversion from arithmetic to boolean masking. 

Input: (A, r), such that x = A + r. 

Output: {x', r), such that x = x' (Br. 

1. Return x' = G[A\. 

It is clear that the algorithm is resistant to first-order DPA, as all intermedi- 
ary variables have the uniform distribution. In the following table, we compare 
our algorithm with Goubin’s algorithm. The pre-computation time and conver- 
sion time is measured in number of fc-bit operations. 





Algorithm 2 


Goubin’s method 


Pre-computation time 


2'=+! 


0 


Gonversion time 


1 


5A: + 5 


Table size 




0 



The pre-computation time and memory required is the main limitation for 
algorithm 2, which is only feasible for conversion with small sizes, such as for 
example A: = 4 or A: = 8 bits. However, the table has to be computed only once for 
each new execution of the cryptographic algorithm; any subsequent conversion 
will require only one operation, instead of 5A: + 5 for Goubin’s method. Therefore, 
algorithm 2 will be more efficient when the number n of conversion during the 
execution of a cryptographic algorithm is greater than: 

2 '=+! 



In this case, our method will be faster with a factor: 

n ■ {5k + 5) 

2^+1 + n 

For example, with k = 8 bits variable size, and n = 24 conversions, algorithm 2 
is roughly two times faster than Goubin’s method. 

3.2 Conversion for £ • fc-Bit Variables Using two fe-Bit Tables 

In this section, we show how to extend the previous algorithm in order to perform 
conversions for larger sizes. We consider variables of size £ ■ k bits, and we use 2 
tables with 2^ variables each. For example, for 32 bit conversions, we can take 
£ = 8 and A: = 4. 
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The idea of the algorithm is the following. We receive as input two I ■ fc-bit 
variables A and i?, such that x = A + R mod 2 ^'*. Our goal is to obtain x' such 
that X = x' ® R, xci such a way that every intermediary variable has the uniform 
distribution. Let split R into with R\ of size {£ — 1 ) ■ k bits, and i?2 of 

size k bits. Then, given a random fc-bit integer r, we let 

A^{A-r) + R2 mod 2 ^^= 

Splitting A into A1HA2, where Ai is of size {£ — 1) ■ k bits, we now have: 
a; = (A1IIA2) + (i?i||r) mod 2 ^^“ 

Then, if A2 + r > 2 ^, we let Ai ^ Ai + I mod This is equivalent to 

computing the carry from the addition A2 + r and then adding this carry to A\ . 
Then, splitting x into xi||a:2, where x\ is of size {£ — 1 ) ■ k bits, we have: 

x\= Ai + Ri mod and X2 = A2 + r mod 2 ^ 

Then we can use the table G generated by algorithm 2 to convert X2 from 
arithmetic masking to boolean masking. More precisely, we let x'2 G[A2], 

which gives: 

X2=X2®r 

Then we let x'2 ^ {x'2 (B R2) ®r so that: 

X2 = x'2 (B R2 

Then we apply the same method recursively to {Ai,Ri) in order to obtain x'l 
such that Xi = x'^ © i?i, so that letting x' = we have: 

x = x' (B R 



as required. 

Actually, we can not compute the carry from A2 + r directly, because this 
would leak some information about x. Instead, we use a randomized carry table 
C, computed in the following way: 



Algorithm 3: carry table C generation. 
Input: a random r of fc bits. 

Output: a table C and a random 7 of fc bits. 



1 . Generate a random fc-bit 7. 

2 . For A = 0 to 2 ^ — 1 do 
' 7, if A + r < 2^= 

7 + 1 mod 2 '=,if A- 

3 . Output C and 7. 



G[A]^ 



> 2 ^= 



Then, instead of testing if A2 + r > 2 ^, we let: 

Ai ^ Ai + G[A2] - 7 mod 2(^-1)'= 
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This gives the following conversion algorithm, based on the pre-computed 
tables G and C of algorithms 1 and 3: 

Algorithm 4: Conversion with £ ■ k bit variable: 

Input: {A, R), such that x = A + R, and r, 7 generated from algorithms 1 and 3. 
Output: x' , such that x = x' (B R- 

1. A^ A-r mod 2^'=. 

2. Let denote R = where R\ is of size {£ — l)k bits. 

3. Let A^ A + R 2 mod 2^^= 

4. li i = 1, then let x' ^ G[A\ © i?2, then x' •<— x' © r and return x' . 

5. Otherwise, let A = A1HA2 

6. Let Ai^Ai + C[A 2 ] mod 

7. Let Ai ^ Ai - 7 mod 

8. Let X2 ^ — ^[^2] © R 2 - 

9. Let X2 ^ X2 © r. 

10. Apply algorithm 4 recursively with (Ai,i?i) to obtain x^. 

11. Return x' = x^||x2 

As previously, this conversion method is resistant to first-order DPA, because 
all intermediary variables have the uniform distribution. We want to compare 
the efficiency of our method with Goubin’s method. The drawback of our method 
is that we need to pre-compute two tables of 2^ values. The advantage of our 
method is that some computation is done on small fc-bits variables, whereas 
Goubin’s method always works with full I ■ k bits variables. Therefore, we must 
take into account the register size of the micro-processor. Our method is likely to 
be more advantageous on a 8-bit microprocessor, which is now the most common 
smart-card platform, than on a 32-bit microprocessor. 

To make a practical comparison, we take fc = 4, and we distinguish two kinds 
of microprocessor: 8-bit and 32-bit, and two variable sizes: 8-bits and 32-bits. 
We take k = 4 because the method is easier to implement for for this value of k, 
but a better trade-off may be possible. We assume that an elementary operation 
on a 32 bit variables requires 4 elementary operations on a 8 bit microprocessor. 
For example, Goubin’s method on 32-bit variables on a 8 bit microprocessor 
will require 4 • (5 • 32 + 5) = 660 operations. More generally, we denote by Rj 
(resp. Gij) our method (resp. Goubin’s method) for z-bit variables with a j- 
bit microprocessor. The following table summarizes the number of steps in all 
possible cases: 





Ts ,8 


78,32 


732,8 


732,32 


Gs ,8 


Gs,32 


7^32,8 


G 32,32 


Pre-computation time 


64 


64 


64 


64 


0 


0 


0 


0 


Gonversion time 


10 


10 


76 


40 


45 


45 


660 


165 


Table size 


32 


32 


32 


32 


0 


0 


0 


0 



As previously, the efficiency improvement depends on how frequently we re- 
compute the randomized tables. If we compute the randomized tables only once 
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at the beginning of the cryptographic algorithm, then our method will always 
be more efficient if there are at least two subsequent conversions. But if we 
choose to re-compute the tables before each conversion, then Goubin’s method 
is more efficient for 8-bit variables, whereas our method is more efficient for 32- 
bit variables. Our method is particularly advantageous for 32-bit conversions on 
a 8-bit microprocessor: our method (64-1-76 operations) is then 4.7 times faster 
than Goubin’s method (660 operations). 

4 Application to SHA-1 

4.1 Overview of SHA-1 

SHA-1 is a hash function introduced by the American National Institute for 
Standards and Technology [13] in 1995. The description of SHA-1 consists of a 
general iteration procedure based on a compression function 

F : {0, 1}®^2 X {0, 1}160 ^ {0^ i}i60 

In the following we give a very general overview of the algorithm (see [13] for 
details). 

General Iteration Procedure: 

1. Pad the message, so that its length is a multiple of the size of the compression 
function, that is 512 bits. 

2. Initialize the five 32-bit chaining variables A,B,C,D,E with a given IV 
value. 

3. For each message block M of 512 bits, let 

{A, B, C, D, E) ^ F{M, {A, B, C, D, E)) + {A, B, C, D, E) 

where F is the compression function. 

4. Output the hash value 411511(7110117?. 

Compression Function F: 

1. Expand the 512-bit message block M into 80 words Mi of 32 bits. 

2. For i = 0 to 79 do: 

{A,B,C,D,E) ^ {Mi + rot5{A) + MB,C,D)+E + Ki, 
4,rot3o(5),(7, D) 

where rotj denotes left rotation by j bits, Ki are constants and: 

fi{X,Y,Z) = (XkY)\{^XkZ), 0<i<19 

fi{X,Y,Z) = X ®Y ® Z, 20 < z < 39,60 < i < 79 

fi{X,Y,Z) = (XkY)\{XkZ)\{YkZ), 40 < z < 59 

We see that SHA-1 combines boolean operations with arithmetic operations. 
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4.2 Motivation 

The SHA-1 hash function can be used for MAC algorithms, for example: 

MACj^(a;) = SHA-l(Ari||a;||A: 2 ) 

or for the HMAC [1] nested construction: 

HMACic(a:) = SHA-l(X 2 ||SHA-l(a:||Xi)) 

where K = Ari||AT 2 is a secret-key. In this case, the implementation of SHA-1 
has to be made resistant against DPA, otherwise a straightforward DPA attack 
would recover the secret-key K . 



4.3 Implementation Result 

In the following, we estimate the number of elementary operations which are 
required to have an implementation of SHA-1 resistant against DPA. Without 
DPA countermeasure, each of the 80 steps in the compression function requires 
roughly 15 elementary 32-bit operations. The DPA countermeasure requires to 
split each variable into 2 shares; this leads to 30 elementary operations. Moreover, 
assuming that A, H, C, D and E have initially a boolean masking, we need to 
convert fi{B, C, D), rot 5 (A) and E into arithmetic masking, then the sum Mi + 
rots (A) + fi{B,C,D) + E + Ki back to boolean masking. This gives 3 boolean to 
arithmetic conversions, each requiring 7 operations using [6] , and one arithmetic 
to boolean conversion. Therefore, each step requires 51 elementary operations 
on 32-bit variables (or 204 operations on 8-bit variables)^, together with one 
arithmetic to boolean conversion. 

In the following table, we compare the efficiency of an implementation 
of SHA-1 resistant against DPA, using our arithmetic to boolean conversion 
method, and using Goubin’s method, for 8-bit and 32-bit micro-processor. The 
time is measured in number of elementary operations for each of the 80 steps 
of the compression function. For our arithmetic to boolean conversion, we re- 
compute the randomized tables before each new conversion. This means that us- 
ing our method, a 32-bit arithmetic to boolean conversion takes 140 elementary 
operations on a 8-bit microprocessor, and 104 operations on a 32-bit micropro- 
cessor. 





8-bit micro 


32-bit micro 


Our method 


344 


155 


Goubin’s method 


864 


216 



^ As previously, we assume that a 32-bit operation on a 8-bit micro-processor requires 
4 elementary operations. 
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5 Conclusion 

We have described a new conversion algorithm from arithmetic to boolean mask- 
ing, which is generally more efficient than Goubin’s algorithm. Our new algo- 
rithm is particularly interesting for 32-bit conversions on a 8-bit microprocessor. 
For example, for SHA-1 hash function, the previous table shows that an imple- 
mentation secure against DPA will be roughly 2.7 times faster using our method 
than using Goubin’s method. 
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Abstract. A new general method for designing key-dependent re- 
versible circuits is proposed and concrete examples are included. The 
method is suitable for data scrambling of internal links and memories on 
smart card chips in order to foil the probing attacks. It also presents a 
new paradigm for designing block ciphers suitable for small-size and/or 
high-speed hardware implementations. In particular, a concrete building 
block for such block ciphers with a masking countermeasure against 
power analysis incorporated on the logical gate level is provided. 

Keywords. Keyed reversible circuits, data scrambling, block ciphers, 
countermeasures, probing attacks, power analysis. 



1 Introduction 

Probing attacks on microelectronic data-processing devices implementing cryp- 
tographic functions, such as smart cards, are invasive techniques consisting in 
introducing conductor microprobes into certain points of a tamper-resistant chip 
within the device to monitor and analyze the electrical signals at these points, in 
order to recover some information about the secret key used (see [1]). They can 
be classified as side-channel attacks if they do not change the functionality of the 
device. In this regard, potentially vulnerable points are those corresponding to 
internal links or memories that are likely to convey or contain secret information 
and whose hardware implementation has a regular, recognizable structure. In a 
microprocessor configuration, the RAM and the bus connecting it to the micro- 
processor are specially vulnerable, and the bus between the microprocessor and 
cryptoprocessor (s) may also be vulnerable. While, for a cryptographic function, 
it should be computationally infeasible to reconstruct the secret key from known 
input and output data, this need not be the case if intermediate data generated 
during the software execution is revealed. Therefore, there is a need to protect 
the sensitive data on data buses and in memories by using dedicated encryption 
techniques. Apart from data, one can also encrypt the memory addresses and 
the code instructions. This encryption may also reduce the vulnerability to other 
side-channel attacks such as the power analysis attacks. 
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The encryption/decryption of data solely on the data bus can be achieved by 
a fast stream cipher combining the data sequence with the keystream sequence, 
produced by a centralized random or pseudorandom number generator, possibly 
by the bitwise XOR operation. However, this solution is not satisfactory for 
encrypting the data to be stored in memories as the same keystream block has 
to be used for encrypting and decrypting any data block for a particular memory 
location. The keystream block can be made dependent on the location itself, but 
is immediately recovered from only one known pair of original and encrypted 
data blocks. A more satisfactory solution for encrypting the memory data, which 
can also be used for encrypting the bus data, is to apply a block cipher, in the 
electronic code book mode, and the encryption has to be performed by logical 
circuits typically in a single microprocessor clock cycle, as it is then transparent 
to the remaining components of the data-processing device. However, the usage 
of classical block ciphers such as DES [13] or AES [5] is not realistic due to very 
restrictive high-speed and small-size requirements and, also, because of small 
and variable block sizes involved. 

Accordingly, there is a need for fast and simple techniques for key-controlled 
reversible transformations which, of course, cannot achieve the same security 
level as classical block ciphers and are therefore called data scrambling instead 
of encryption. Nevertheless, it has to be noted that the probing attacks are not 
easy to mount and consequently partial instead of full knowledge of ciphertext 
corresponding to known, possibly chosen, plaintext is a much more appropriate 
assumption for data scrambling. To this end, one can use simplified iterated de- 
signs for block ciphers, with a reduced block size and with a reduced number 
of rounds, but the usual method of bitwise XORing the expanded secret key 
with intermediate ciphertexts is not good enough for relatively small block sizes. 
Instead, one can incorporate a larger number of key bits by using key-controlled 
bit permutations to permute bits in a block. Several constructions for such per- 
mutations are proposed in [3] for block sizes being a power of 2. Nevertheless, a 
small number of rounds and the linearity of bit permutations do not allow one 
to achieve a sufficient security level especially if the block size is very small. 

The main objective of this paper is to introduce a new and generic method 
for iterated construction of key-controlled reversible transformations which can 
incorporate a large number of key bits in a small number of rounds even for 
very small block sizes. The new, so-called DeKaRT method is based on using 
small elementary building blocks connected by fixed bit permutations, where 
all the blocks are simple and can be of the same type. The resulting hardware 
designs are granular, simple, and fast, and can be customized easily by choosing 
different building blocks or connections among them. Consequently, the method 
is very suitable for data scrambling to thwart the probing attacks, as well as for 
new designs of hardware-oriented block ciphers. Moreover, the hardware imple- 
mentations of such block ciphers can easily be made resistant to differential and 
simple power analysis attacks on the logical gate level by applying and adapting 
the masking technique [10] to the building blocks used. 
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More details on data scrambling techniques together with some security re- 
quirements and considerations are given in Section 2. The DeKaRT method for 
designing key-dependent reversible transformations and logical circuits is pro- 
posed in Section 3 and some examples are presented in Section 4. The application 
of the DeKaRT method for the design of block ciphers is discussed in Section 5 
and the application of the masking technique against power analysis is explained 
in Section 6. Conclusions are given in Section 7. 



2 Data Scrambling 

For an n-bit microprocessor, the typical block size to be dealt with by data 
scrambling is n bits, but can be smaller. For example, apart from the data 
to be stored in RAM, the address of the location where the data has to be 
stored can be scrambled too. In addition, the scrambling function for data can 
be made dependent on the original address, for example, by deriving the key 
for this scrambling function from the same secret key and the original address 
by a simple hash function. Of course, scrambling the address foils the known 
plaintext scenario for scrambling the data, as the address of the location where 
the scrambled data is stored is not known. Not only is the address size usually 
smaller than n, but also need not be a power of 2. Another example is scrambling 
the data and memory addresses within the microprocessor core (e.g., for the 
cache memory), where usually certain restrictions have to be respected when 
transmitting and storing the data. Accordingly, the block sizes as small as 8 bits 
or even smaller are likely to be encountered in data scrambling. For such block 
sizes, the key-controlled bit permutations cannot provide enough uncertainty of 
the key used. 

For very small block sizes, data scrambling is inherently vulnerable to the 
dictionary attack, in the known or chosen plaintext scenario. This attack recon- 
structs the secret scrambling transformation used, and not the secret key itself. 
Secret key reconstruction attacks are more important because the same secret 
key bits can be used repeatedly for scrambling different data. However, it is 
important to have in mind that in the context of probing attacks, the known 
plaintext and known ciphertext scenarios are not realistic due to the fact that 
the ciphertext is likely to be known only partially. 

If an iterated construction is used for data scrambling and if a number of 
secret key bits is incorporated in each of a small number of rounds (e.g., 2 to 5), 
then the effective key size is roughly halved due to the structural meet-in-the- 
middle attack in the known plaintext scenario. So, the number of key bits per 
round has to be relatively large, and this is not easy to achieve for very small 
block sizes. In particular, as proposed in [3], one can use a network composed of 
a small number of alternating substitution and key-controlled bit permutation 
layers, where the substitution layers consist of fixed and small (e.g., (4 x 4)-bit) 
reversible S-boxes. For 5 layers altogether, because the bit permutations are lin- 
ear functions for a given key, the structural attack [2] in the chosen plaintext 
scenario is then applicable, as already noted in [3]. The attack is able of recon- 
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structing both the S-boxes, if they are unknown to the attacker, and the bit 
permutations in all the layers, up to an equivalence of the total transformation, 
and works even for large block sizes. 

The secret key used for data scrambling should better be innovated for each 
new execution of the cryptographic function on the chip considered, having in 
mind that changing the key for scrambling the data in RAM can be done only 
when the stored data is all used. As a consequence, the secret key is much less 
exposed to side-channel attacks such as the power analysis attacks and hence 
the data scrambling schemes need not in principle be protected by the masking 
technique [9], [10] which randomizes (and slows down) the key-dependent com- 
putations. The secret key should be completely independent of the secret key 
stored on the chip which is used for the cryptographic function itself. It can be 
generated by a random number generator implemented in hardware on the same 
chip or, alternatively, but less securely, by a relatively strong and simple pseu- 
dorandom number generator (stream cipher), implemented in hardware, from a 
secret seed and some innovation information, which does not have to be secret. 
It is important to emphasize that the secret key used for scrambling should be 
stored in a hardware-protected register, not in RAM. 



3 DeKaRT Method 

Our strategic objective stemming from data scrambling applications described in 
previous sections is to propose a generic method for constructing key-dependent 
reversible transformations {0, 1}^ — >■ {0, 1}'^ by logical (combinatorial) circuits 
that can incorporate a relatively large number of key bits with a relatively small 
number of logical gates arranged in a small number of levels, even if the block 
size N is very small, such as N = 8. 

In iterated constructions of block ciphers, where the block size N is at least 
64, the number of key bits per each round is typically at most N and the key 
bits are incorporated by using the bitwise XOR operation. Each round typically 
contains a layer of fixed nonlinear S-boxes, independent of the key, to provide 
confusion and may contain an extra layer of fixed affine transformations, also 
independent of the key, to provide diffusion. In a cipher like AES, the S-boxes 
have to be reversible, but act on all the input bits. If a Feistel structure is used, 
like in DES, then S-boxes do not have to be reversible, but act on a portion 
of input bits only. So, if N is small, then a large number of rounds is needed 
to incorporate a large number of key bits, and this is not acceptable for data 
scrambling as the depth of the corresponding logical circuit would be too large. 

For data scrambling purposes, the number of key bits per round can be 
increased by using key-controlled bit permutations to incorporate the key, as 
suggested in [3]. However, for small N such as < 16, this is not sufficient. Also, 
the key-controlled bit permutations are not cryptographically strong themselves. 
For example, regardless of the key, they preserve the Hamming weight of input 
data and, as a consequence, the XOR of all the output bits is always equal to 
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the XOR of all the input bits. On the other hand, the computation of S-boxes 
does not involve any key and takes a considerable number of logical levels. 

Accordingly, what we essentially need is an iterated construction composed of 
a number (not too large) of layers, where each layer implements a key-dependent 
reversible transformation, has a small logical depth, and is able to incorporate 
a number of key bits larger than the block size. A small logical depth in fact 
implies that each output bit of each layer depends on a small number of input 
bits to that layer. In the construction that we propose each layer consists of 
a number of small building blocks, each block implementing a key-dependent 
reversible transformation. 

A generic bulding block is shown in Fig. 1. It acts on a small number of input 
data bits which are divided into two groups of m and n bits, respectively. The 
m input bits are used for control and are passed to the output intact, like in 
the Feistel structure. They are then used to select k out of 2"^k key bits by the 
multiplexer (MUX) circuit with m control bits, 2"^ fc-bit inputs, and the fc-bit 
output k. The MUX circuit in fact implements an m x fc lookup table, i.e., k 
(binary) m x 1 lookup tables that are specified by the key. Finally, the k bits 
are then used to select an (n x n)-bit reversible transformation acting on the 
remaining n input bits to produce the corresponding n output bits. The m and 
n bits are called the control and transformed data bits, respectively. The total 
number of the key bits in the building block is thus fc2™, which can easily be 
made larger than m + n. The set of used reversible transformations has to be 
chosen in a way that can easily be implemented by a logical circuit with n + k 
input bits and n output bits. The inverse building block is the same except that 
the reversible transformations i?k EH'e replaced by their inverses R^^- 




Fig. 1. The generic DeKaRT building block. 



The underlying design paradigm is that a part of input data chooses a key and 
the key chooses a reversible transformation acting on the remaining part of input 
data. This justifies the name data-chooses-key-chooses-reversibledransformation 
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and the notation D— >-K— >-RT for such a paradigm. For simplicity, we propose the 
name DeKaRT. 

Consequently, in each layer, N input bits are divided into small blocks and 
each of them is transformed by an elementary DeKaRT block. In a uniform 
design all the building blocks are of the same type. The layers are connected 
by fixed bit permutations satisfying the following two diffusion properties. In a 
uniform design, the bit permutations between the layers are the same. First, the 
control bits in each layer should be used as the transformed bits in the next layer. 
Therefore, the number of control bits in each layer cannot exceed the number of 
the transformed bits. In a uniform design, this implies that m < n. Second, for 
each building block, both control bits and transformed bits should be extracted 
from the maximal possible number of building blocks in the preceding layer. In a 
uniform design, this number equals min(m, N /{m + n)) for the control bits and 
min(n, N/ (m + n)) for the transformed bits. In the inverse DeKaRT network, the 
layers are applied in the reverse order and the inverse bit permutations are used. 
If m = n, then the used bit permutations can be made equal to their inverses. 

The definition given above is quite general, but is already sufficient for 
proposing specific constructions. Some concrete examples for the DeKaRT build- 
ing blocks are given in the next section. For cryptographic security, a number of 
desirable additional criteria are also proposed. 

— First, regarding the choice of m + n, it is prudent to require that the number 
of building blocks per each layer is at least 2. 

— Second, regarding the choice of the reversible transformations i?k> it can be 
required that each output bit of i?k is a nonlinear function of input data bits 
and the key k. Moreover, it can also be required that the algebraic normal 
form of this function contains at least one binary product involving both 
input data and key bits. In this way, the transformed and control input bits 
at each layer are nonlinearly combined together. This criterion implies that 
n > 1, as the only reversible functions of one binary variable are the identity 
and the binary complement functions, so that the single key bit has to be 
XORed with the input bit to obtain the output bit. Already for n = 2 this 
criterion can be satisfied, as shown in the next section. It is not satisfied if 
k = n key bits are bitwise XORed with n input data bits, as in the usual 
Feistel structure. 

— Third, it is desirable that the set of reversible transformations i?k satisfies 
a Shannon-type criterion that the uncertainty of n input bits provided by 
purely random k key bits when the output n bits are known is maximal 
possible, that is, n bits. For this to hold, it is necessary that k > n. This 
criterion can easily be satisfied by bitwise XORing a subset of n key bits 
with n input data bits. 

Cryptanalysis of the DeKaRT networks with a small number of rounds is a 
problem interesting for future investigations. To this end, the method from [4] 
developed for Feistel networks with four rounds and randomly chosen round 
functions may be relevant. 
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4 Examples 

In order to specify a concrete DeKaRT building block, one has to choose the 
parameters m, n, and k and to propose a logical circuit to implement the 
parametrized reversible transformation as a function of n input data bits 
and k key bits. This logical circuit should be relatively simple in terms of size 
and depth. 

A simple and sufficient method for designing such logical circuits is to use 
XORs with 2 input bits and (controlled) SWITCHes only, where a SWITCH 
has 2 input bits, 2 output bits, and I control bit that determines if the input 
bits are swapped or not. Clearly, a SWITCH can be implemented by using two 
MUXes in parallel, whereas only one MUX suffices for implementing each XOR. 
Here and throughout, unless specified differently, a MUX has 2 input bits, 1 
control bit, and 1 output bit. For each XOR, one of the two input bits is a key 
bit, whereas for each SWITCH, the control bit is a key bit. The individual key 
bits are incorporated into the circuit in such a way that there are no equivalent 
keys, i.e., that different combinations of key bits give rise to different reversible 
transformations. This is not a problem for checking since the parameters n and k 
are small. For each fixed key, such reversible transformations are affine, and the 
nonlinearity is achieved by the selected key bits depending on the control input 
data bits. Note that for n = 2, all 24 reversible transformations of 2 input bits 
are necessarily affine. The Shannon- type criterion is not satisfied if the circuit 
contains the key-controlled SWITCHes only. The resulting DeKaRT building 
blocks thus incorporate, extend, and generalize, on an atomic level due to small 
block sizes, elements of known block cipher design principles such as Feistel 
structures [13], data-dependent bit permutations [14], [11], and key-dependent 
bit permutations [3]. 

The basic concrete example satisfying the desirable properties from Section 
3 with parameters (m, n, k) = (2, 2, 3) is shown in Fig. 2. Many other examples 
can be obtained similarly. First, by removing the 2 XORs and 1 control input 
we get a DeKaRT block with parameters (1,2, 1), which will be called the sim- 
plified block from Fig. 2. Second, by removing 1 control input we get a block 
with parameters (1, 2, 3). Third, by removing the SWITCH we get a block with 
parameters (2,2,2). Fourth, a very elementary block with parameters (1,1,1) 
contains only 1 MUX with 2 input key bits and 1 XOR for the reversible trans- 
formation. The DeKaRT block from Fig. 2 can be implemented by using a circuit 
of 13 MUXes with depth 4, whereas a circuit implementing the corresponding 
simplified block has size 3 and depth 2, also in terms of MUXes. The two blocks 
incoroporate the total of 12 and 2 key bits, respectively. 

The blocks can readily be used for defining concrete data scrambling func- 
tions of the DeKaRT type. For example, for N = 16, in the uniform DeKaRT 
network based on the block from Fig. 2, each layer contains 4 such blocks and 
hence has the total of 42 MUXes, has depth 4, and incorporates 48 key bits. Ac- 
cordingly, five layers like this incorporate 240 key bits and can be implemented 
by a circuit with 210 MUXes and depth 20. Similarly, for N = 15, in the uniform 
DeKaRT network based on the simplified block from Fig. 2, each layer contains 
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Fig. 2. An elementary DeKaRT building block. 



5 such blocks and hence has the total of 15 MUXes, has depth 2, and incorpo- 
rates 10 key bits. Accordingly, ten layers like this incorporate 100 key bits and 
can be implemented by a circuit with 150 MUXes and (the same) depth 20. It 
may be possible to further reduce the size and/or depth by using an optimized 
ASIC design, where the optimization also depends on the fixed bit permuta- 
tions used between the layers. Both the networks incorporate a relatively large 
number of key bits and have a very small size and depth, which, for a relatively 
small N such as N < 16, is impossible to achieve with the networks of S-boxes 
and key-controlled bit permutations. Also, due to the DeKaRT paradigm, their 
cryptographic security is considerably improved. 

In conclusion, the proposed DeKaRT method is a new and interesting tool 
to be used for data scrambling purposes. Depending on the size and depth con- 
straints stemming from particular applications, one can either use the proposed 
concrete DeKaRT designs or easily derive new customized concrete designs by 
using various DeKaRT building blocks as well as various bit permutations to be 
used between the layers. 

5 Application for Block Ciphers 

The generic or concrete DeKaRT designs described in the preceding section 
can also be used for constructing high-speed and/or small-size block ciphers 
suitable for hardware implementations. For example, they may be used for the 
(proprietary) encryption of copyright digital data to be stored in non-volatile 
memories for multimedia applications. What is needed is to use sufficiently many 
DeKaRT building blocks per layer according to the increased block size, N > 64, 
and to increase the number of layers/rounds to achieve a required security level, 
which is higher than for data scrambling applications. Namely, it is required 
that it should be computationally infeasible to reconstruct the secret key faster 
than by exhaustive search, given any number of arbitrarily chosen plaintext- 
ciphertext pairs. Since the size and depth of each DeKaRT layer is considerably 
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smaller than in a usual iterated construction of block ciphers, the number of 
rounds in the DeKaRT construction can be several times larger. For example, 
for the DeKaRT building block from Fig. 2 and N = 128, the number of rounds 
can be about 32 or larger. 

For cryptographic security, one should satisfy the desirable criteria from Sec- 
tion 3 as well as possibly introduce two more round keys of size N to be bitwise 
XORed with the input and output bits. Furthermore, in view of the statisti- 
cal cryptanalytic methods such as the linear cryptanalysis of block ciphers [8], 
instead of using only the bit permutations between the layers, it is prudent to 
use very simple reversible linear functions. For example, if the total numbers of 
transformed and control data bits per layer are equal, one can design the bit per- 
mutations as explained in Section 3 and then XOR every transformed data bit 
at the input to each layer with a distinct transformed data bit from the preced- 
ing layer. This usually does not increase the logical depth of the layers. For the 
DeKaRT building block from Fig. 2, a preliminary analysis shows that less than 
32 rounds are then sufficient to achieve resistance to the linear cryptanalysis, 
provided that the round keys are purely random and independent. 

Unlike the data scrambling functions, the encryption or decryption functions 
do not have to be performed in only one microprocessor clock cycle, so that 
they can be implemented by a combination of logical circuits and registers. For 
example, several layers at a time can be implemented by a logical circuit. Note 
that the pipelined architectures for the DeKaRT constructions are extremely 
fast due to the small depth of each layer. 

For the DeKaRT design, the required number of key bits per round is typi- 
cally larger than the block size. This is needed for data scrambling applications 
where the block size and the number or rounds are both relatively small. For ex- 
ample, the DeKaRT building block from Fig. 2 requires 3 key bits per input bit. 
Furthermore, as for block cipher applications the number of rounds is increased, 
the total number of key bits required is larger than in usual block cipher designs. 
These key bits can be produced from a smaller number of secret key bits, stored 
in RAM or in a hardware-protected register, by a key expansion algorithm. 

The key expansion algorithm can produce the round keys iteratively and 
can itself be implemented in hardware by a combination of logical circuits and 
registers, so that not all the round keys have to be stored. The relations between 
the round keys should not facilitate the secret key reconstruction attacks in 
the chosen plaintext scenario and should prevent the secret key reconstruction 
attacks in the related key scenario. The proposed DeKaRT variant of the key 
expansion algorithm is as follows. Let K and K' denote the bit sizes of the secret 
key and the round key, respectively. The K secret key bits are first expanded 
by linear transformations into K' key bits by using an appropriate linear code 
so that any subset of K" expanded key bits are linearly independent, where 
K" is relatively large {K” < K). In other words, the minimum distance of the 
dual of this linear code should be at least K" + 1 (e.g., see [7]). The obtained 
expanded key is then used as an input to a DeKaRT network of block size K' 
which is parametrized by a fixed randomly generated key satisfying an additional 
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condition that every MUX block in the network implements balanced binary 
lookup tables. The K' bits produced after every two layers of the DeKaRT 
network are successively used as round keys, together with the K' input bits. 
For the decryption transformation, the round keys can be produced in the reverse 
order starting from the round key for the last round, which can be precomputed 
and stored. 

As the number of layers is thus doubled when compared with the DeKaRT 
network used for the block cipher, the DeKaRT building blocks used could be as 
simple as the simplified block from Fig. 2. Alternatively, the K' round key bits 
can be produced after each layer of the DeKaRT network, if one allows portions 
of successive round keys to be bit permutations of each other. In the iterated 
DeKaRT algorithm each round is a reversible transformation, which is important 
as it satisfies the criterion that each round key is purely random if the input to 
the first round is purely random. 

The key expansion algorithm can be simplified by using only linear trans- 
formations in the following way. The K secret key bits are first expanded by 
linear transformations into 2K' key bits by using an appropriate linear code so 
that there are no small subsets of linearly dependent expanded key bits. The 
expanded 2K' bits are then used as the round keys for the first two rounds, 
whereas the subsequent pairs of successive round keys are produced by applying 
fixed bit permutations to the expanded key bits. Of course, other simplifications 
or modifications are also possible. 

6 DeKaRT Construction with Masking against Power 
Analysis 

Side-channel attacks on software or hardware implementations of various cryp- 
tosystems aim at recovering the secret key information from certain physical 
measurements performed on the electronic device during the computation such 
as the power consumption, the time, and the electromagnetic radiation. They do 
not change the functionality of the device and are typically not invasive. Power 
analysis attacks [6] are very powerful as they do not require expensive resources 
and as most (software) implementations without specific countermeasures in- 
corporated are vulnerable to such attacks. Among them, the (first-order) differ- 
ential power analysis (DPA) attacks are particularly interesting, because they 
use a relatively simple statistical technique that is almost independent of the 
implementation of the cryptographic algorithm. More sophisticated statistical 
analysis of power consumption curves may also be feasible. 

The basis of power analysis attacks on cryptographic electronic devices are 
elementary computations within the device that depend on the secret key in- 
formation and possibly on the known output and input information. If in ad- 
dition the power consumption corresponding to these elementary computations 
depends on the input data, then it is not surprising that the power consump- 
tion curves contain information about the secret key which may be feasible to 
extract by statistical techniques. Software implementations, in which the oper- 
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ations are synchronized by the microprocessor clock, are especially vulnerable. 
Hardware implementations are also potentially vulnerable, but may require a 
higher sampling frequency for obtaining the power consumption curves. A gen- 
eral algorithmic strategy to counteract power analysis attacks is to randomize 
the computations depending on the secret key by masking the original data with 
random masks and by modifying the computations accordingly [9], [10]. An al- 
ternative way of dealing with power analysis attacks is making use of a special 
encoding of data that tends to balance the power consumption, such as the 
dual-rail encoding, and the corresponding self-checking logical circuits, possibly 
asynchronous (see [12]). 

The DPA attack on a microelectronic device implementing a cryptographic 
algorithm can be prevented if every elementary computation involving the secret 
information and performed by a logical gate in the hardware implementation is 
randomized. More precisely, the general condition to be satisfied is that the out- 
put value of each logical gate in the protected hardware design should follow the 
same probability distribution for each fixed value of the secret key and input in- 
formation, where the uncertainty is provided by purely random masks. In other 
words, it should be statistically independent of the secret key and input informa- 
tion. In principle, even only one logical gate violating the condition may render 
the hardware implementation vulnerable to DPA. It may be interesting to note 
that randomizing the software operations does not ensure that the underlying 
hardware operations are all randomized. 

In principle, for a logical gate with m binary inputs only m independent 
masking bits are sufficient, but this number can possibly be reduced. Of course, 
the greater the total number of masking bits used in the whole circuit, the greater 
the resistance to more sophisticated power analysis attacks such as a higher- 
order DPA. So, the whole hardware implementation can be masked by masking 
individual logical gates, and masking a logical gate means finding an equivalent 
logical circuit that can securely compute the masked output from the masked 
inputs, where all the masks are binary and are XORed with the input and output 
bits. More precisely, for a logical gate implementing a Boolean function f{X), 
the masked gate should implement the function f'{X',R,r) = f{X' © i?) © r, 
where i? is a binary vectorial input mask and r is a binary output mask. The 
computation is required to be secure if A' = A © i? and the computed output is 
then /(A) © r, as desired. The masking bits should preferably be produced by 
a random number generator each time the cryptographic function is executed. 

A general masking technique based on using the MUX gate with 2 input 
bits, 1 control bit, and 1 output bit is proposed in [10]. It essentially consists 
in representing a Boolean function by a tree of MUXes and then in masking 
the MUXes. The function values as binary constants are used as inputs to the 
top layer of MUXes in the tree and are all masked by the same masking bit, 
which is also the masking bit for the output. The main observation from [10] is 
essentially that a MUX as an elementary logical gate can be masked by using a 
cascade connection of a SWITCH and a MUX, where the SWITCH is controlled 
by the control masking bit and the MUX is controlled by the masked control bit . 
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Namely, if the two input masking bits are the same, then this cascade produces 
the MUX output bit masked with the same mask as the two inputs. Moreover, 
the two outputs of the SWITCH are computed securely, that is, each of them 
is statistically independent of the original input bits. As a SWITCH can be im- 
plemented by a parallel connection of two MUXes, the masked MUX is thus 
securely implemented by using 3 MUXes and has a logical depth of 2 MUXes. 
Consequently, the resulting logical circuit for the masked Boolean function con- 
tains 3 times as many MUXes and has double depth in comparison with the 
original tree of MUXes. This is the price to pay for the protection against DBA. 
It is not specifically explained in [10] how to apply this technique to an arbitrary 
logical circuit. 

An important issue which is not adressed in [10] is related to the atomicity of 
the MUX implementation in hardware. If the MUX output is produced in a single 
step by a single gate, then the masked MUX computation is secure. However, in 
practice the MUX is usually implemented by using the logical AND, OR, and 
NOT gates. Namely, if c denotes the control input and x and y the two data 
inputs, then the MUX implements the Boolean function c A a; V c Ay. Now, if 
in the masked MUX each of the 3 MUXes is implemented in this way, it can be 
proven that the output of each elementary gate used is computed securely. 

As mentioned in Section 2, if a new secret key for data scrambling is produced 
for every new execution of the cryptographic function and since the constraints 
regarding the speed and size are very restrictive, then the DeKaRT networks used 
for data scrambling need not be protected against power analysis by a masking 
technique. On the other hand, if the DeKaRT network is used for a block ci- 
pher, it should better be protected by masking. All what is needed is to mask 
the individual DeKaRT building blocks used and this can be achieved by adapt- 
ing the MUX-based masking technique [10] described above. It is interesting 
that the masked round key bits needed can themselves be securely computed 
by the masked DeKaRT network used for the key expansion algorithm. Note 
that the key expansion algorithm is not vulnerable to DPA as it does not in- 
volve any input data, but other power analysis techniques may be applicable. 
A DeKaRT building block can be masked by using a representation in terms of 
MUXes (or SWITCHes) and by replacing each MUX by a masked MUX. The 
masked SWITCH is equivalently a cascade connection of two SWITCHes being 
controlled by the control masking bit and the masked control bit, respectively. 
The only condition to be respected is that the two inputs for each MUX (or 
SWITCH) should be masked by the same binary mask and that the control bit 
should be masked by an independent binary mask. 

The MUX block from Fig. 1 can directly be represented in terms of MUXes. 
In fact, it contains k distinct trees of MUXes with a total of k{2"^ — 1) elementary 
MUXes with 2 input bits, 1 control bit, and 1 output bit. In each of the k trees 
there are m levels of MUXes controlled by m control data bits and 2™ key bits 
are used as inputs to the top level. Now, assume that m control data bits are 
masked by independent masking bits as well as that 2™ key bits for each of the 
k trees are masked by the same mask. Accordingly, if each MUX is replaced by 
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a masked MUX, then each of produced k output bits is masked by the same 
bit that is used for masking the 2 "^ key bits at the input to the top level of the 
corresponding tree and all the computations are secure. The k masking bits can 
be independent, but other options are also possible. For example, they can all 
be the same, whereas all the masking bits for m control data bits can also be 
the same. Note that these two masking bits have to be independent in order for 
the computation to be secure. 

In order to mask the block implementing the key-dependent reversible trans- 
formations i?k) the corresponding logical circuit should be represented in terms 
of MUXes. This is already the case with any logical circuit composed of XORs 
and SWITCHes, as suggested in Section 4, and in particular with the circuit 
shown in Fig. 2. Then masking can be achieved by replacing each SWITCH by a 
masked SWITCH and by keeping the XORs as they are, having in mind that the 
output mask for an XOR is the XOR of the two input masks. The constraints 
to be satisfied are that the two inputs for each SWITCH should be masked by 
the same binary mask and that the control bit should be masked by an inde- 
pendent mask as well as that the two inputs for each XOR should be masked by 
independent masks. If needed, the mask at any point can be changed by using 
an extra XOR with another independent mask. 

In the particular example from Fig. 2, which can be taken as an elementary 
building block to produce iterated block ciphers, we have (m,n, fc) = (2,2,3). 
The MUX block containing 3 trees each composed of 3 MUXes can be masked 
as explained above. The logical circuit composed of 2 XORs and 1 SWITCH can 
be masked by keeping the XORs and by replacing the SWITCH by a masked 
SWITCH, i.e., by a cascade of two SWITCHes, while the masking bits should 
be assigned following the general guidelines given above. Namely, the main con- 
dition to satisfy is that the two input masks for the cascade of SWITCHes 
should be the same. This can be achieved without introducing two extra XORs 
to change the masks for the two data inputs x\ and X 2 by using the following 
two assignments of masking bits. 

Let fci, k 2 , and fcs denote the key bits used for the 2 XORs and the SWITCH 
in Fig. 2, respectively. In the first assignment, the input data bits Xi have inde- 
pendent masks 1 < z < 4. If /ci and k 2 are masked by V 2 and ri, respectively, 
then the two inputs to the masked SWITCH are masked by the same mask, 
T\® r 2 , as desired. The third key bit k^ can then be masked either by r\ or X 2 
and the output bits of the masked SWITCH are then both masked by r\ © T 2 - 
Two additional XORs at the output can be used to change this mask into r\ 
and T 2 for yi and j/ 2 , respectively. As a result, the output four masks are the 
same as the input four masks, but the masking bits for the round keys should 
be adapted to the masking bits for the data. 

In the second assignment, the input data bits Xi are all masked by the same 
mask r, whereas all the key bits are themselves also masked by the same mask 
ro, which is independent of r. The two inputs to the masked SWITCH are then 
masked by the same mask, r © rg, and so are the two output bits y\ and z/ 2 - To 
maintain the same type of mask assignment, the mask for the inputs x^ and X 4 
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can be changed into r©ro by using two additional XORs. This does not increase 
the depth of the masked DeKaRT block which remains to be 7 MUXes. For the 
second assignment, it is easier to produce the masked round keys as their masks 
are independent of the masks used for the data. In particular, one can use only 
one masking bit for the whole data block and only one, independent masking 
bit for each round key. Then after each round, every intermediate ciphertext bit 
will be masked by the same masking bit and this bit changes from round to 
round depending on the round key masking bits. The masked round keys can be 
obtained analogously, by using the corresponding masked DeKaRT network. 

7 Conclusions 

The proposed DeKaRT method for constructing key-dependent reversible logical 
circuits is not only suitable for data scrambling functions, but can also be used for 
constructing a new general type of hardware-oriented block ciphers as well as the 
required key expansion algorithms. In addition, the resulting hardware designs 
can efficiently be protected against power analysis by a masking technique on 
the logical gate level. 
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Abstract. Deliberate injection of faults into cryptographic devices is an effec- 
tive cryptanalysis technique against symmetric and asymmetric encryption algo- 
rithms. In this paper we will describe parity code based concurrent error detec- 
tion (CED) approach against such attacks in substitution-permutation network 
(SPN) symmetric block ciphers [22], The basic idea compares a carefully modi- 
fied parity of the input plain text with that of the output cipher text resulting in a 
simple CED circuitry. An analysis of the SPN symmetric block ciphers reveals 
that on one hand, permutation of the round outputs does not alter the parity from 
its input to its output. On the other hand, exclusive-or with the round key and 
the non-linear substitution function (s-box) modify the parity from their inputs 
to their outputs. In order to change the parity of the inputs into the parity of out- 
puts of an SPN encryption, we exclusive-or the parity of the SPN round function 
output with the parity of the round key. We also add to all s-boxes an additional 
1-bit binary function that implements the combined parity of the inputs and out- 
puts to the s-box for all its (input, output) pairs. These two modifications are 
used only by the CED circuitry and do not impact the SPN encryption or de- 
cryption. The proposed CED approach is demonstrated on a 16-input, 16-output 
SPN symmetric block cipher from [1]. 



1 Introduction 

Until recently cryptanalysts analyzed cipher systems by using rigorous mathematics 
based techniques such as differential cryptanalysis [2] and linear cryptanalysis [3]. 
Although these techniques are useful in exploring weaknesses in algorithms, they do 
not exploit weaknesses in their implementations. Hardware and Software implementa- 
tions of (crypto) algorithms leak information via side-channels such as time consumed 
by the operations, power dissipated by the operators, electromagnetic radiation emitted 
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by the device and faulty computations resulting from deliberate injection of faults into 
the system. Traditional cryptanalysis techniques can be combined with such side- 
channel attacks to uncover break the secret key and/or break the implementation de- 
tails of the cipher. Even a small amount of side-channel information is sufficient to 
break common ciphers [4]. For example. Differential Fault Analysis (DFA) that uses 
deliberate injection of faults requires between 50 to 200 cipher text blocks to recover a 
key of symmetric block cipher Data Encryption Standard (DES); the best traditional 
attack requires approximately 64 terabytes of plain text and cipher text encrypted 
under a single key. 



1.1 Fault Based Attacks: Motivation 

Fault based cryptanalysis (for example, DFA) is based on the observation that faults 
deliberately injected into a crypto-device leak information about the implemented 
algorithms. These attacks are practical since elevated levels of radiation or heat, incor- 
rect voltage, or atypical clock rate can cause a tamperproof device to malfunction. 
Boneh, DeMillo and Lipton [5] presented the first fault based side-channel attack 
against asymmetric public-key cryptography devices. More recently, Biham and 
Shamir [6] presented a fault-based cryptanalysis of symmetric block cipher Data En- 
cryption Standard (DES). They presented a transient fault based Differential Fault 
Analysis (DFA) attack and a permanent fault based non-DFA attack to recover the 
round keys using a very small number of cipher texts. They then extended their fault 
model to show that DFA can uncover the structure of an unknown cryptosystem im- 
plemented in an EEPROM based smart card based on the observation that it is much 
easier to inject a 1^0 bit flip than to inject a 0-4l bit flip in an EEPROM. Using DES 
as the unknown cipher, they showed that (i) about 500 faulty cipher texts are sufficient 
to identify the bits of the right half, (ii) about 5000 faulty cipher texts are sufficient to 
identify the non-linear substitution operations (s-boxes) and their input and output bits, 
and (iii) about 10000 faulty cipher texts are sufficient to reconstruct the DES s-boxes. 

Anderson and Kuhn described additional fault based side-channel attacks on soft- 
ware implementations of encryption algorithms [7]. In one of the attacks they assumed 
that the instruction memory of smart cards can be corrupted. If in a process loop, the 
variable controlling the number of rounds is set to 1, encryption executes just one 
round, thereby compromising the round key. Another attack focused on the chip writ- 
ing ability of the attacker. Assuming that the attacker is familiar with the implementa- 
tion, he can extract keys from the card by overwriting specific memory locations. 



1.2 Fault-Based Side Channels: The Fault Models 

Boneh, Demillo and Lipton [5] use a practical fault model wherein a fault is induced at 
a random bit location in one of the registers at some random intermediate round of a 
cryptographic computation. Biham and Shamir [6] use a similar realistic fault model 
wherein either a transient or a permanent fault is induced randomly into the device. 
They then adapt this basic fault model to the asymmetric property of EEPROMs: it is 
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much easier to induce a 1-^0 bit flip than to induce a 0^1 bit flip. Anderson and 
Kuhn used two different fault models for microcontroller based smartcards: in the first 
they assume that the instruction memory of smart cards can be randomly corrupted 
and in the second they assume that the attacker has the ability to write into specified 
locations in the memory. These and other fault attacks and associated fault models are 
summarized in [8,9]. The proposed CED approach is applicable to the practical fault 
models described. 



1.3 CED Architectures for Symmetric Block Ciphers: Background 

Concurrent error detection (CED) followed by suppression of the corresponding faulty 
output can thwart fault injection attacks; on detecting a faulty computation, the stored 
key is protected by suppressing the corresponding faulty cipher text. 

Straightforward duplication and comparison of encryption and decryption hardware 
yields more than 100% hardware overhead. Alternatively, a spare module for each 
type can be used to detect faults in hardware modules of that type. Such a spares based 
approach has been adopted in a hardware implementation of the 128-bit symmetric 
block cipher IDEA [10]. Spares based approaches are suitable for block ciphers that 
use arithmetic operators, such as IDEA and RC6 [11]. Although hardware is not dupli- 
cated, an extra module for each operation type entails considerable hardware over- 
head, especially for encryption algorithms like Advanced Encryption Standard (AES) 
[12] and DES that use random, non-arithmetic operations such as S-Boxes. 

Time redundancy based CED approach involves encrypting (decrypting) the data a 
second time followed by the comparison of two results. Wolter et. al. [13] developed a 
CED technique for symmetric block cipher IDEA wherein the test data was encrypted 
and then decrypted. This approach entails more than 100% time overhead. Eurther, it 
can only tolerate transient faults if the data traverses identical paths through the en- 
cryption and decryption data paths both during the normal computation and during the 
re-computation. 

Karri et. al [14] developed a systematic CED approach for symmetric block ciphers 
at the register transfer level that exploits the inverse relationship between the encryp- 
tion and decryption at the algorithm level, round level and individual operation level. 
They demonstrated this inverse-relationship principle on 128-bit symmetric block 
ciphers including Advanced Encryption Standard, RC6 and Serpent. The main draw- 
back of this approach is that it assumes that the cipher device operates in a half-duplex 
mode (i.e. either the encryption or the decryption but not both are simultaneously 
active). Bertoni et. al. [15] applied this inverse-relationship principle to round key 
generation of the AES encryption algorithm using additional hardware for inverse 
round key generation and comparison. 

Another CED approach involves encoding the message before encryption and 
checking it for errors after decryption. Wolter et. al. [13] used residue codes for fault 
detection in adders, multipliers, and EXCLUSIVE-ORs. Area overhead of this ap- 
proach is due to the encoders at the input and decoders at the output to translate the 
plain and cipher texts into the internal code words. In [16] the plaintext is encoded by 
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setting several bits of the message to a particular fixed value, 0 or 1, and then en- 
crypted. A mismatch between these fixed bits of deciphered text and the original plain 
text detects an error. The simple code (all zeroes or all ones) results in significantly 
less area overhead when compared to other encoding schemes. This scheme has a 
large fault detection latency detects since faults in the encryption hardware at the 
transmitter end by the decryption hardware at the receiver. Further, there is an associ- 
ated performance penalty since it uses some of the bits in messages for error detection. 

In [17] a CED technique that predicted the inverse of the parity of the outputs was 
proposed for the non-linear s-box and other functions used in DES. A similar tech- 
nique that predicted the parity of the outputs for the non-linear s-box and linear mixing 
functions used in the AES was proposed in [18,19]. In these papers one additional 
parity bit per byte at the outputs of the s-boxes is added. To detect errors at the inputs 
to the s-boxes, the inputs of the s-boxes are also parity encoded. The size of the s- 
boxes is doubled by proposing a 512x9-bit implementation resulting in an area over- 
head of over 100% for s-box CED. 



pi. 



plaintext 






p 1 6 



si 1 



sl2 



sl3 



sl4 




s31 



s32 



s33 



s34 



1 1 1 1 


1 1 


L 




► ^ * 





s41 



s42 



s43 



s44 



^,+ 444+44.44.4. + 4fl4. 4 k4 

ciphertext 



Fig. 1. Substitution Permutation Network (SPN) cryptosystem 
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2 Substitution-Permutation Network (SPN) Block Ciphers 

The architecture of a symmetric block cipher contains a key expansion module, an 
encryption module and a decryption module. Key expansion module expands the user 
key to generate round keys and loads them into the key RAM prior to encryption or 
decryption. Using the round keys, the device encrypts (decrypts) the plain (cipher) text 
to generate the cipher (plain) text. Symmetric block ciphers have an iterative looping 
structure. All the rounds of encryption and decryption are identical in general, with 
each round using several operations and round key(s) to process the input data. Con- 
sider the well-known substitution-permutation network (SPN) cryptosystem shown in 
Figure 1. Such an SPN architecture consisting of a non-linear substitution layer (s- 
boxes) connected by output bit position permutations is an easy to understand yet 
realistic architecture [1]. The example SPN cryptosystem shown in Figure 1 operates 
on a 16-bit plaintext generating a 16-bit cipher text and four rounds. Each SPN en- 
cryption round is composed of a non-linear substitution operation (using four 4x4 s- 
boxes), a permutation and exclusive-or with a 16-bit round key. The sixteen 4x4 s- 
boxes in this example cryptosystem are different. To preserve symmetry between 
encryption and decryption, the first round operation is preceded by exclusive-or with 
the 16-bit key, key 0. Then the four 16-bit round keys (key 1, key 2, key 3, and key 4) 
are exclusive-ored following the permutation operation (In Figure 1 the dots on the s- 
box input lines represent exclusive or) in each round. 

2.1 Parity-Based Concurrent Error Detection 

Protection of crypto-devices entails protecting the encryption/decryption data paths as 
well as the key ram used to hold the round keys. Significant work has been done to 
protect the RAM using Parity code, Hamming code etc. In this paper we are interested 
in CED of the encryption data path and we do not address CED for key RAM. 

The proposed CED design approach uses parity code. The specific CED imple- 
mentation depends on the SPN implementation architecture. Consider the unfolded 
implementation architecture shown in Eigure 1 (this is necessary because all 16 s- 
boxes are different). The parity of the inputs to the first round, P(x) is determined by a 
parity tree of the 16 inputs. The CED structure modifies this input parity according to 
the successive processing steps of the SPN round function such that the modified par- 
ity is equal to the parity of the outputs of the SPN circuitry of the first round. The 
CED architecture shown in Eigure 2 repeatedly modifies the parity in the manner 
discussed in each of the four rounds and compares it with the parity of the cipher text. 

The operations in an SPN round are; non-linear transformations by the sixteen 4x4 
substitution boxes (s-box), bit-permutation and exclusive-or with the round key. Non- 
linear substitution boxes used in SPN-based and other encryption algorithms have 
been designed to satisfy properties such as maximum non-linear order, high nonline- 
arity, low differential uniformity and low bias [20,21]. Satisfaction of these properties 
has been shown to reflect the strength of the s-box against linear and differential 
cryptanalysis. These s-boxes do not maintain the parity from their inputs to their out- 
puts. 
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Table 1. 4x4 substitution box supplemented with m (i) =parity (i)©parity(s(i)) 
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We add to every four input, four output s-box an additional binary output for the 
purpose of modifying the input parity of the SPN circuitry into the output parity in the 
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considered SPN round. Let s be an s-box with four inputs il, i2, i3, i4 and four outputs 
sl(il,i2,i3,i4), s2(il,i2,i3,i4), s3(il,i2,i3,i4), and s4(il,i2,i3,i4). Then for every input i= 
il,i2,i3,i4 of the s-box, the additional modifying output m(i) implements il© i2© i3© 
i4© si© s2© s3© s4© =parity(i)©parity(s(i)). Parity (i) and parity (s(i)) are the input 
parity and the output parity respectively of the considered s-box. The design of this 
modifying output m (i) for an example 4x4 s-box from [1] is shown now. The first two 
columns of Table 1 show the truth table of an s-box. In the first column the four-bit 
inputs i and in the second column the four-bit outputs s(i) of the s-boxes are given in 
hexadecimal representation. In the third column the binary value m(i) which imple- 
ments m(i) = parity(i)©parity(s(i)) is given. Thus, for example in row 6 of Table 1 for 
the input i = il,i2, i3,i4 = 0101 = 5 the functional output of the s-box is s(5) = 
sl(5),s2(5),s3(5), s4(5) = 1111 = F, and for the additional binary output m(5) of the s- 
box we have m(5) =0©1©0©1©1©1©1©1=0. 

In the complete CED architecture shown in Figure 2, a thick box appended to the 
right hand side of an s-box shows this modifying output. This modifying output in 
each s-box is used only for CED and does not impact either the encryption or the de- 
cryption. Since we do not change the functionality of the s-boxes, the strength of the 
used cryptographic algorithm, based to a large extent on the concrete form of the s- 
boxes, is preserved. 

Next, since permutations do not change the parity no modification circuitry is nec- 
essary. Einally, bit-wise modulo 2 addition of the 16-bit key 0 modifies the parity of 
the input plain text by parity of key 0 prior to the first SPN round. Bit-wise modulo 2 
addition of the 16-bit key 1 modifies the parity of the input plain text by parity of key 
1 in the first SPN round. Similarly, bit-wise modulo-2 addition of the 16-bit key 2 
modifies the parity of the input plain text by parity of key 2 in the second SPN round 
and so on. The overall modification due to all the round keys can be pre-computed 
during round key generation as parity of key 0 © parity of key 1 © parity of key 2 © 
parity of key 3 © parity of key 4. This absorbs the associated time overhead into that 
of round key generation. 

While this architecture might apply to most of the common examples, it doesn’t 
necessarily apply to all cases. Eor example, not all architectures require an explicit 
decryption module; some block ciphers, DES being the most noteworthy example, 
looks practically identical regardless of the direction as long as the round keys are 
reversed. Also, while many block ciphers have some internal iterative round compo- 
nent, often in practice, they consist of other structures (such as pre and post-whitening 
steps). However, this general principle can be easily adapted to these situations. The 
proposed CED method is also applicable to other symmetric-key primitives such as 
message authentication codes and stream ciphers that have an SPN structure. 
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3 Fault Detection Capability 



In section 2 we explained how the input parity of an SPN round is modified step-by- 
step according to the processing steps of that round in such a way that the modified 
parity of the inputs is equal to the parity of the outputs of that round if no error occurs. 

In this section we show how an error due to a single stuck-at-fault in a processing 
step of the SPN Symmetric Block Cipher is detected by the proposed CED method. 
This is illustrated in Figure 3 for four successive processing steps. The inputs x are 
processed into the outputs y in step 1 and the parity P(x) of the inputs is modified into 
P(x)©P(x)©P(y)=P(y). We assume now that a fault f occurs in the hardware imple- 
menting the processing step 2 with the result that the outputs of the second step are 
now Zj instead of the correct outputs z, with z^^z. The parity P(y) is corrected in this 
second step into the correct value P(y)©P(y)©P(z) = P(z). If the error due to the fault 
f is detectable by parity we have P(Zj);^P(z). In step 3 the erroneous inputs Zj (instead 
of the correct inputs z) are correctly processed by the fault-free hardware of this third 
step into the outputs u^ and now the parity P(z) is modified into P(z)©P(Zj)©P(U|.). 



P(x) 




P(Vf) 



P(Zf)©P(z)©P(Vf) 



Fig. 3. Analysis of fault detection capability 

Similarly in step 4 the inputs u, are correctly processed into v^ and the parity 
P(z)©P(Zj)©P(Uf) is now modified into P(z)©P(Zf)©P(Uj)©P(U|,)©P(Vj)= P(z)© 
P(Zf)©P(Vj). Finally the modified parity P(z)©P(Zf)©P(Vj) is compared with P(Vj), the 
parity of the outputs v, of step 4, and for P(z) P(Zj) the error due to the fault f will be 
detected. Thus, if, due to a fault f in step 2, a single bit (or an odd number of bits) is 
erroneous this error will be always detected by comparing the parity of the outputs of 
step 4 with the corresponding modified input parity. 
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Cryptographic algorithms are designed to satisfy the strict avalanche criterion 
[20,21]; even a single hit error at the inputs of an encryption step results in many dif- 
ferent erroneous hits at the outputs of the following encryption steps. But, as ex- 
plained, this property of encryption algorithms has no influence on the error detection 
capability of the proposed CED method. 

The processing steps of the considered SPN Symmetric Block Cipher are comp- 
nent-wise exclusive-or with the round key, permutation and non-linear S-box trans- 
formation. For the operation exclusive-or with a round key every single (internal or 
external) stuck-at fault of an exclusive-or gate will result in a single bit error which is 
detected by parity. For the permutation operations a single stuck-at fault will result in 
a single bit error which is also detected by parity. For the 16 parity-appended S-boxes 
with four inputs and five outputs (four functional outputs and one parity modifying 
output) CED capability is described now. We designed these S-boxes using SISII logic 
synthesis tool from UC Berkeley. Then for all possible single stuck-at 0/1 faults the 
synthesized S-boxes were simulated for all possible input combinations. If all the 
outputs of the S-boxes are independently implemented (i.e. without sharing gates) 
every single stuck-at-fault results in a single bit error of the outputs of the corre- 
sponding S-box and will be obviously detected by parity. If all the five outputs of the 
S-boxes are jointly optimized then (in rare cases) even number of S-box output bits 
may be in error due to a single stuck-at fault. Then, as the experiments show, 96.3 % 
of the errors due to a single stuck-at fault are detected by parity. If the four functional 
outputs of the S-boxes are jointly optimized and if the parity-modifying bit is sepa- 
rately implemented 98.5% of the errors due to a single stuck-at fault are detected by 
parity. The area of an S-box with four inputs and four outputs without error detection 
in a two-level implementation is 56 units. With an additional fifth parity modifying 
output for CED the area, also in a two-level implementation, is 66 units. Thus, for a 
two-level implementation the area of the S-boxes increases by 18%. For a multi-level 
optimization the area of an S-box without CED is 41 units and with the additional 
parity modifying output it is 51 units. Thereby when the parity modifying output is 
separately implemented, and for a multi-level implementation the area of an S-box 
increases by 24.4%. Thus the overall hardware overhead is determined by an addi- 
tional parity tree for computing the input parity; an 18% to 24.4% overhead for the 
implementation of the parity modification of the S-Boxes and some exclusive-or gates. 
As we have shown for a separate two-level implementation of the S-boxes with parity 
modification 100% error detection for all the errors due to single stuck-at faults is 
guaranteed by the proposed method. 



3.1 Performance Penalty and Detection Latency 

The parity of the input plaintext and output cipher text are computed for each plaintext 
and hence the associated delay should be carefully accounted for. Computing and 
checking of the parity can be combined with the round operations in several ways as 
shown in Table 2. 
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Table 2. Optimizing the latency of CED 



Clock 

cycle 


Approach A 


Approach B 


Approach C 


Approach D 




Round 




Round 




Round 




Round 




1 




P(PT) 


1 


P(PT) 


1 


P(PT) 


1 


P(PT) 


2 


1 




2 




2 


P(l) 


2 


P(l) 


3 


2 




3 




3 


P(2) 


3 


P(2) 


4 


3 




4 




4 


P(3) 


4 


P(3) 


5 


4 






P(CT) 




P(CT) 






6 




P(CT) 















In this table, we assume that encryption is implemented in four clock cycles with 
one round of encryption per clock cycle. In the straightforward approach A, the parity 
computation of the plaintext is followed by encryption (decryption) that in turn is 
followed by computing and checking the parity of the cipher text. This approach has a 
performance penalty of two clock cycles (one clock cycle for computing the parity of 
the input and one clock cycle for computing the parity of the cipher text + comparison 
with the input parity) and a fault detection latency of one complete encryption (de- 
cryption) i.e. four clock cycles. 

In approach B, the parity of the plaintext is computed concurrently with the first 
round of encryption in clock cycle 1. This is then followed by the rest of the encryp- 
tion (decryption) which in turn is followed by computation and checking of the parity 
of the cipher text. This approach reduces the performance penalty to one clock cycle 
without reducing the detection latency. In Approach C computation of the parity of the 
plaintext is performed concurrently with the first round of encryption in clock cycle 1 . 
This is then followed by computation and checking of output of round one in parallel 
with the second round of encryption in clock cycle 2 and so on. This approach reduces 
the fault detection latency while maintaining the performance penalty of Approach B. 
Other approaches to absorbing performance penalty associated with CED are possible. 
Each approach has associated performance penalty, fault detection latency (the worst 
case duration between occurrence and detection of a fault) and fault coverage. 



4 Conclusions 

In this paper a new method for CED for SPN encryption Block Ciphers was proposed. 
More details on this general method can be found in [22]. Many of the well known 
symmetric block ciphers including AES [12] are SPN ciphers. According to the proc- 
essing steps of the SPN network the parity of the inputs of an encryption round is 
modified into the parity of the outputs and compared with the actual parity of the out- 
puts of this round. To reduce the necessary hardware overhead the parity tree for com- 
puting the parity of the inputs can be also be used to compute the parity of the outputs. 
If all functional outputs and the output for parity modification of the S-boxes are sepa- 
rately optimized a 100% error detection for all the errors due to single stuck- at faults 
was achieved. The additional area overhead is low. It consists of an additional parity 
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tree for computing the parity of the inputs, a 18% to 24% increase of the area for the 
S-Boxes and a few exclusive-or gates only. The proposed concurrent error detection 
method allows detection of deliberately injected faults in addition to technical faults. 



References 



1. H. Keys, “A tutorial on linear and differential cryptanalysis,” 
http://citeseer.ni.nec.com/443539.html. 

2. E. Biham and A. Shamir, "Differential Cryptanalysis of DES-like Crytptosystems", Jour- 
nal of Cryptography, Vol. 4, No. 1, pp. 3-72, 1991. 

3. M. Matsui, “Linear Cryptanalysis Method for DES Cipher,” Proceedings of Advances in 
Cryptology-Eurocrypt, Springer- Verlag, pp. 386-397, 1994. 

4. J. Kelsey, B. Schneier, D. Wagner, and C. Hall, “Side-Channel Cryptanalysis of Product 
Ciphers,” Proceedings of ESORICS, Springer, pp. 97-110, Sep 1998. 

5. D. Boneh, R. DeMillo, and R. Lipton, “On the importance of checking cryptographic pro- 
tocols for faults”. Proceedings of Eurocrypt, Lecture Notes in Computer Science, Springer- 
Verlag, LNCS 1233, pp. 37-51, 1997. 

6. E. Biham and A. Shamir, “Differential Fault Analysis of Secret Key Cryptosystems”, 
Proceedings of Crypto, Aug 1997. 

7. R. J. Anderson and M. Kuhn, “Low cost attack on tamper resistant devices”. Proceedings 
5'* International Workshop on Security Protocols, Lecture Notes in Computer Sciences, 
Springer- Verlag, LNCS 1361, 1997. 

8. C. Aumuller, P. Bier, P. Hofreiter, W. Fischer and J.-P. Seifert, “Fault attacks on RSA with 
CRT: concrete results and practical countermeasures,” www.iacr.org/eprint/2002/072.pdf . 

9. J. Bloemer and J.-P. Seifert, “Fault based cryptanalysis of the Advanced Encryption Stan- 
dard,” www.iacr.org/ eprint/2002/07 5 .pdf . 

10. R. L. Rivest, M. J. B. Robshaw, R. Sidney, and Y. L. Yin, “The RC6 block cipher”, 
ftp://ftp.rsasecuritv.com/pub/rsalabs/aes/rc6vl 1 .pdf 

11. H. Bonnenberg, A. Curiger, N. Felber, H. Kaeslin, R. Zimmermann and W. Fichtner, 
“VINCI: Secure test of a VLSI high-speed encryption system”. Proceedings of IEEE Inter- 
national Test Conference, pp. 782-790, Oct 1993. 

12. J. Daemen and V. Rijmen, “AES proposal: Rijndael”, 
http://www.esat.kuleuven.ac.be/~riimen/ riindael/ riindaeldocV2.zip 

13. S. Wolter, H. Matz, A. Schubert and R. Laur, “On the VLSI implementation of the Inter- 
national Data Encryption Algorithm IDEA”, IEEE International symposium on Circuits 
and Systems, Vol.l, pp. 397-400, 1995. 

14. R Karri, K. Wu, P. Mishra and Y. Kim, “Concurrent Error Detection of Fault Based Side- 
Channel Cryptanalysis of 128-Bit Symmetric Block Ciphers,” IEEE Transactions on CAD, 
Dec 2002. 

15. G. Bertoni, L. Breveglieri, I. Koren and V. Piuri, “On the propagation of faults and their 
detection in a hardware implementation of the advanced encryption standard,” Proceedings 
ofASAP’02, pp. 303-312, 2002. 

16. S. Femandez-Gomez, J. J. Rodriguez-Andina and E. Mandado, “Concurrent Error Detec- 
tion in Block Ciphers”, IEEE International Test Conference, Oct 2000. 

17. A. S. Butter, C. Y. Kao and J. P. Kuruts, “DES encryption and decryption unit with error 
checking,” US patent US5432848, Jul 1995. 




124 



R. Karri, G. Kuznetsov, and M. Goessel 



18. G. Bertoni, L. Breveglieri, I. Koren, P. Maistri and V. Piuri, “A parity code based fault 
detection for an implementation of the advanced encryption standard,” Proceedings IEEE 
International Symposium on Defect and Eault Tolerance in VLSI, pp. 51-59, Nov. 2002. 

19. G. Bertoni, L. Breveglieri, I. Koren, and V. Piuri, “Error Analysis and Detection Proce- 
dures for a Hardware Implementation of the Advanced Encryption Standard,” IEEE Trans- 
actions on Computers, vol. 52, No. 4, pp. 492-505, April 2003. 

20. A. E. Webster and S. E. Tavares. “On the design of S-hoxes,” Proceedings of CRYPTO ’85, 
Springer Verlag Lecture Notes in Computer Science, LNCS 218, pp. 523-534, 1986. 

21. H. Keys and S. E. Tavares, “Avalanche characteristics of substitution permutation encryp- 
tion networks,” IEEE Transactions on Computers, vol. 44, no. 9, pp. 1131-1139, Sep 
1995. 

22. R. Karri, M. Goessel, and G. Kousnezow, “Method for error detection in kryptographic 
substitution permutation networks,” patent application pending. 




Securing Encryption Algorithms against DPA at the 
Logic Level: Next Generation Smart Card Technology 



Kris Tiri and Ingrid Verbauwhede 



UCLA Electrical Engineering Department, 

7440B Boelter Hall, P.O. Box 951594, Los Angeles, CA 90095-1594 
{tiri, ingrid}@ee . ucla . edu 



Abstract. This paper describes a design method to secure encryption algorithms 
against Differential Power Analysis at the logic level. The method employs 
logic gates with a power consumption, which is independent of the data signals, 
and therefore the technique removes the foundation for DPA. In a design ex- 
periment, a fundamental component of the DES algorithm has been imple- 
mented. Detailed transistor level simulations show a perfect security whenever 
the layout parasitics are not taken into account. 



1 Introduction 

The physical implementation of an encryption algorithm is bound to provide an at- 
tacker with important information on top of the plain- and ciphertext used in tradi- 
tional cryptanalysis. Variations in, among other things, the power consumption of the 
encryption module and the arrival time of the encrypted data can be observed, and 
possibly linked to the input data and the secret key. Attacks that use this additional 
information and link it to the internal state, and hence to the secret key, are referred to 
as Side Channel Attacks (SCA’s) [1]. 

From these SCA’s, Differential Power Analysis (DPA) is the most powerful. It relies 
on statistical analysis and error correction to extract information from the power con- 
sumption that is correlated to the secret key [2]. Many countermeasures that conceal 
the supply current variations at the architectural or the algorithmic level have been put 
forward. Yet, they are not really effective or practicable against DPA and/or its de- 
rivatives, as the variations actually originate at the logic level. 

The fact that the power consumption of a single logic gate, which is the most elemen- 
tary building stone of the complete encryption module, is controlled by both the logic 
value and the sequence of its input signals forms the basis of DPA. Using a logic style 
for which a logic gate has at all times a constant power consumption that is independ- 
ent of the signal transitions, removes the foundation of DPA and is therefore an effec- 
tive means to halt DPA. 

In this paper, we first present the basics of Differential Power Analysis. Then, we 
briefly discuss Sense Amplifier Based Logic, which is a logic style with signal inde- 
pendent power consumption. Next, we (1) introduce the design experiment, which 
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consists of securing a module of the DES algorithm against DPA at the logic level; (2) 
investigate the effectiveness of the SABL approach; and (3) discuss the effects of the 
layout parasitics. Finally a conclusion will be formulated. 



2 Basics of Differential Power Analysis 

Differential Power Analysis has been extensively described in literature. It was first 
introduced in [2] . 

A DPA is executed in two phases: data collection and data analysis. During data col- 
lection, the power consumption of the device is measured by sampling and recording 
the supply current for a large number of encryptions. During data analysis, a selection 
function, which depends on a guess of some bits of the secret key, divides the power 
measurements into two sets. For each set, a typical supply current is calculated and 
subsequently a Differential Trace (DT) is generated by computing the difference be- 
tween the two typical supply currents. 

The selection function D consists of predicting a state bit of the encryption module. If 
the correct subset of the secret key has been predicted, D is correlated with the state 
bit and hence with the power consumption of the logic operations that are affected by 
this state bit. The power consumption of the other logic operations and measurement 
errors however, are uncorrelated. As a result, the DT will approach the effect of the 
target bit on the power consumption and there are noticeable peaks in the DT. If on the 
other hand the guess on the secret key was incorrect, the result of the selection func- 
tion is uncorrelated with the state bit: the DT will approach 0. 



3 Sense Amplifier Based Logic: A CMOS Logic Style with Signal 
Independent Power Consumption 

Every logic style can be classified into one of the two existing logic families. If the 
logic gate continuously draws a current from the supply and measures its state through 
the path the current takes, the logic style is said to be Current Mode Logic (CML). If 
on the other hand, the logic gate only draws a current from the supply to change state 
and measures its state by the amount of charge it stores on a capacitance, the logic 
style is said to be Voltage Mode Logic (VML). 

CML has constant power consumption under the condition that the gate draws a per- 
fectly constant current from the power supply and this independently of the in- and/or 
output signals. In order to build a current source capable of generating a constant cur- 
rent, special circuit techniques that minimize channel length modulation have to be 
used. The decisive drawback of CML however, is its static power consumption: even 
when the logic gate is not processing any data, it continuously burns the current, which 
makes this logic style impractical for low power applications. 

A better alternative is Sense Amplifier Based Logic (SABL) [3]. SABL is a VML 
style that uses a fixed amount of charge for every transition, including the degenerated 
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events in which the gate does not change state. This means that the logic gate charges 
in every cycle a total capacitance with a constant value, even though ultimately differ- 
ent capacitances are switched. 

In short, SABL is based on two principles. First, it is a Dynamic and Differential 
Logic (DDL) and therefore has one and exactly one switching event per cycle and this 
independently of the input value and sequence. Second, during that switching event, it 
discharges and charges the sum of all the internal node capacitances together with one 
of the balanced output capacitances. Flence, it discharges and charges a constant ca- 
pacitance value. While many DDL-styles exist [4], only SABL (1) controls exactly the 
contribution of the internal parasitic capacitances of the gate into the power consump- 
tion by charging and discharging each one of them in every cycle; and (2) has sym- 
metric intrinsic in- and output capacitances at the differential signals such that it has 
balanced output capacitances. 

In addition to the fact that every cycle the same amount of charge is switched, the 
charge goes through very similar charge and discharge paths during the precharge 
phase and during the evaluation phase respectively. As a result, the gate is subject to 
only minor variations in the input-output delay and in the instantaneous current. This 
is important since the attacker is not so much interested in the total charge per 
switching event, as in the instantaneous current and will sample several times per 
clock cycle in order to capture the instantaneous current. 



4 Design Experiment 

4.1 Description of Experimental Setup 

Goal of the design experiment is to develop design guidelines and to identify possible 
hurdles for securing an encryption module against DPA at the logic level by simply 
removing the foundation of DPA. For this purpose, any encryption algorithm could 
have been chosen. The reason for choosing the DBS algorithm [5] is the focus of a 
great part of contemporary research on how to perform and how to thwart DPA on the 
DBS algorithm. 

In order to obtain supply current traces that are as accurate as possible, simulations 
have been run at the transistor level using HSPICB. Simulating the complete algorithm 
however, is computationally unfeasible and the algorithm has been stripped-down to a 
minimum. 

The experimental setup, which is shown in Big. 1, is part of the last round of the DBS- 
algorithm. The module calculates 4 bits of the ciphertext C using a subkey K of length 
6 and 4 and 6 bits of the left and right plaintexts L and R respectively. The substitution 
box is the SI -box. The expansion of the right plaintext R, the permutation of the result 
of the S-box, and the inverse initial permutation, which are present in the actual DBS- 
algorithm, have been discarded, as they do not change the power measurements: they 
are hardwired. 

The selection function D(C,b,K) consists of calculating bit number b of the left plain- 
text L, using the known ciphertext C and a guess on the secret key K. The right plain- 




128 



K. Tiri and I. Verbauwhede 



text R is also known. In the DES algorithm, R is fed into the inverse initial permuta- 
tion to form part of the ciphertext C. 




Fig. 1. Experimental setup: DPA on a submodule of the last round in the DES-algorithm 

Restricting the experiment to the implementation in Fig. 1 does not simplify the task 
of putting a stop to DPA. On the contrary, in the implementation of the complete DES 
algorithm, the power consumption caused hy the calculation of the other bits in the 
same and in the previous rounds, will act as an extra and large noise source on the 
power measurements. Note also that in this experiment, all measurements are ‘perfect 
measurements’. Aside from the accuracy of HSPICE, there is no quantization error, 
thermal noise, jitter on the clock of the encryption module, jitter on the sampling mo- 
ment or any other phenomenon that may introduce a measurement error. 

To allow for a comparison, the module has been implemented both in static comple- 
mentary CMOS logic (SC-CMOS), which is the default logic style in a standard cell 
library, and in SABL for a O.lSpm, 1.8V CMOS technology. Simulations have been 
done in HSPICE. In total, the supply current has been captured for 5000 clock cycles 
with a random input at the plaintext registers L and R, and with a fixed secret key K. 
The same random input, and the same secret key have been used for both implemen- 
tations. In order to capture all current variation, the sampling frequency has been set to 
lOOGHz, which corresponds to one sample every lOps. Note that this very high level 
of accuracy demands massive simulations. The most time-consuming simulation re- 
quired 275 hours on a HP Visualize BIOOO to complete. 



4.2 Effectiveness of the SABL Approach 

In a first setup, the simulations are based on a netlist that does not include effects 
caused by the layout. The parasitic capacitances coming from the intra and inter cell 
routing of the data signals have been neglected. Fig. 2 shows the transient and statisti- 
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cal properties of the simulated supply current of the SC-CMOS and the SABL imple- 
mentation of the module presented in Fig. 1 . 



SC-CMOS implementation 




Time - [ns] 



Time - [ns] 



SABL implementation 




Fig. 2. Simulated supply current: supply current transient of 4 clock cycles (left) and supply 
current characteristics based on 5000 clock cycles (right) for SC-CMOS (top) and SABL (bot- 
tom) implementation 

Fig. 2(left) depicts a snapshot of the supply current transient. In total, 4 clock cycles of 
each 4ns are shown. The supply current of the SABL implementation is very regular 
and independent of the input signals, whereas the supply current of the SC-CMOS 
implementation is completely different from cycle to cycle and hence highly depend- 
ent on the input signals. 

Note that the supply current of the SABL module alternates between a short, high 
current peak and a time span with a lower current. These events correspond respec- 
tively to the precharge phase, in which all gates switch at the same moment, and the 
evaluation phase, in which each gate switches when its inputs arrive from preceding 
gates. The current in the evaluation phase is caused by the pairs of static inverters that 
have been inserted between the SABL gates in order to cascade these dynamic gates 
according to the domino design rules. We preferred the domino design rules to np 
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design rules for ease of implementation. The pairs of static inverters however, add an 
extra penalty on the area and the power consumption. The mean energy consumption 
per clock cycle of the SABL implementation is 11.25pJ compared to 2.70pJ for the 
SC-CMOS implementation. 

Fig. 2(right) depicts the statistical properties of the entire supply current transient. 
Three curves that describe the typical supply current are shown: they represent the 
mean supply current, the absolute variation in the mean supply current and the stan- 
dard deviation on the mean supply current. The curves are generated by first folding 
the supply current of the 5000 clock cycles on top of each other into 1 clock cycle to 
generate an ‘eye’-diagram and then subsequently calculating the point wise mean, 
absolute variation and standard deviation. The curves confirm our observations. The 
mean current of the SABL implementation is a representative switching event for the 
supply current in every clock cycle. The maximum absolute variation and the maxi- 
mum standard deviation are 0.37 mA and 89.5 |J,A respectively. These values corre- 
spond to 2% and 4.8% of the mean current at their point of occurrence. The SC- 
CMOS implementation however, experiences a significant variation in the supply 
current from clock cycle to clock cycle. The maximum absolute variation and the 
maximum standard deviation are 3.66 mA and 591.2 |a,A respectively. These values 
correspond to 239% and 38.1% of the mean current at their point of occurrence. Table 
1 summarizes the numbers. 



Table 1. Simulated supply current: variation in the typical supply current based on 5000 clock 
cycles for SC-CMOS and SABL implementation 



Implementation 


max(abs. var.) 


ratio to 


max(std. dev.) 


ratio to 




[mA] 


mean current* 


[flA] 


mean current* 


SC-CMOS 


3.66 


239% 


591.2 


38.1% 


SABL 


0.37 


2% 


89.5 


4.8% 



*At point of occurrence. 



Fig. 3 shows the Differential Traces that have been generated with 8 different key 
guesses in the selection function. The first bit of plaintext register L has been pre- 
dicted. Note that in total 64 (=2*’) different guesses of the secret key are possible. Only 
the DT’s of 8 of them are shown for transparency of the figure. The other 56 DT’s 
however, are in accordance with the curves that correspond with the 7 incorrectly 
guessed keys. The correct secret key is 46. 

For the SC-CMOS implementation, the DT of the correct secret key exhibits peaks 
that are significantly higher than the DT’s of the incorrectly guessed keys. All peaks 
can be brought back to certain precise events. The first peak around 0.5ns corresponds 
to the rising edge of the clock. At this instant of time, the output of the first bit of the 
register L becomes equal to the bit that we predicted with the selection function. The 
second peak around 3ns corresponds to the instant that the input to the first bit of reg- 
ister L changes from the bit that we predicted to a new random input. The last peak 
around 3.5ns corresponds to the falling edge of the clock. 
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For the SABL implementation, one can not distinguish which DT is from the correct 
secret key. Moreover, the DT of the correct secret key would not even be considered 
as the DT of a possible correct secret key. Contrary to the SC-CMOS module, an 
analysis of precise events is not possible. 
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Fig. 3. Differential Traces based on 5000 clock cycles generated by 8 successive key guesses 
for SC-CMOS (top) and SABL (bottom) implementation. Key 46 is secret key. Please note the 
different scales for SC-CMOS and SABL implementations 

On top of the fact that the DT of the correct secret key does not have any noticeable 
peaks for the SABL module, the DT’s of the SABL module are much smaller than the 
DT’s of the SC-CMOS module. As a result, to determine the DT’s of the SABL im- 
plementation, the test equipment that captures the supply current in the measurement 
setup should have a much better accuracy than is necessary to determine the DT’s of 
the SC-CMOS implementation, which are almost 2 orders of magnitude larger. Fig. 4 
details the DT’s that have been generated with the correct secret key. 

Fig. 5 shows the influence of the data collection on the information content in the 
DT’s. In each plot, the peak-to-peak value (p2p) or the root- mean- square value (rms) 
of the DT’s generated by (1) the correct secret key; (2) an incorrect secret key; and (3) 
a random bit string as selection function, are shown in function of the number of clock 
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Fig. 4. Differential Traces based on 5000 clock cycles generated by correct secret key for SC- 
CMOS and SABL implementation 
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cycles used to generate the DT. The random bit string has been used to avoid any 
statistical biases of the S-box output. Note the scale difference on the vertical axis 
between the SC-CMOS implementation and the SABL implementation. 

For the SC-CMOS implementation, a mere 200 clock cycles are enough to disclose the 
correct secret key. For the SABL implementation however, more than 5000 clock 
cycles have been simulated and the correct secret key does not stand out. It is very 
unlikely that increasing the number of clock cycles will make the correct secret key 
stand out. The transient response of the curves has died out and the p2p and the rms 
are in a steady state response; they are set by the section of the DT, which corresponds 
to the power consumption of the SI -box. The power consumption of the SI -box is 
uncorrelated with the selection function, as the bits in the left plaintext have no influ- 
ence whatsoever on what happens inside the SI -box. 

One could argue that occasionally the correct secret key also seems to stand out for the 
SABL implementation. Fig. 5 however, shows the p2p and rms of only one incorrect 
secret key. There are DT’s of other incorrect secret keys for which the p2p or rms are 
comparable and/or higher than for the correct secret key. 
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Fig. 5. Influence of data collection: peak-to-peak value (left) and root-mean-square value (right) 
of the Differential Traces generated by the correct secret key, an incorrect secret key and a 
random bit string as selection function for SC-CMOS (top) and SABL (bottom) implementa- 
tion. Please note the different scales for SC-CMOS and SABL implementations 



4.3 Effects of Layout Parasitics 

The SABL approach has shown to be an effective remedy against DPA when the 
layout parasitics are not taken into account. In the next setup, the simulations are 
based on a netlist that does account for the effects of the layout. First, a cell library has 
been created that contains all cells used in the module. Then, these cells have been 
used to place and route the module. The complete layout in SABL is shown in Fig. 6. 
The parasitic capacitances from the intra- and inter-cell routing will not only result in 
a performance degradation, in particular in an increase of the input-output delay and of 
the power consumption, but they will also result in variations in the total charge that is 
used per switching event if both differential output signals do not see the same para- 
sitic capacitances. 
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Special attention has been given to the layout of each cell in an effort to balance its 
intrinsic in- and output capacitances. The inter-cell routing has been addressed by 
routing the differential lines in the same environment. This assures that the parasitic 
capacitances to other metal layers are comparable at both interconnects. Further, the 
cross coupling between long adjacent lines in the same layer has been addressed with 
shielding. The shielding has a tradeoff with an increase in power consumption and in 
area. 
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Fig. 6.Layout of SABL implementation of module presented in Fig. 1 

Fig. 7, which depicts the transient and statistical properties of the simulated supply 
current, is in accordance with Fig. 2(bottom). The snapshot of the supply current tran- 
sient remains very regular and independent of the input signals. Though, compared 
with the case that the layout was not taken into account, there is approximately a pen- 
alty of 100% in the input-output delay and in the power consumption. The mean cur- 
rent of the SABL implementation remains likewise a representative switching event 
for the supply current in every clock cycle. The maximum absolute variation and the 
maximum standard deviation are 0.26 mA and 65.8 |J,A respectively. These values 
correspond to 13% and 2% of the mean current at their point of occurrence. Note that 
in spite of the increase in power consumption, the absolute figures have decreased 
with approximately 30% compared with the case that the layout was not taken into 
account. The relative figure of the maximum absolute variation however, has in- 
creased by a factor of 5. The relative figure of the maximum standard deviation on the 
other hand, has decreased by a factor of 2.5. 

Even though there does not seem to be a significant difference in the supply current 
characteristics between the module before layout and the one after the layout phase, 
the DT of the correct secret key exhibits 2 peaks that are higher than the DT’ s of the 
incorrectly guessed secret key as can be seen in Fig. 8. The first peak around 0.5ns 
corresponds to the rising edge of the clock. At this instant of time, the output of the 
first bit of the register L changes state. The peak has a value of 10.28 |J,A, which is a 
factor of 12.3 smaller than the peak at the rising edge of the SC-CMOS implementa- 
tion. The latter implementation however, did not include the layout parasitics. Includ- 
ing the layout parasitics into the SC-CMOS implementation will increase this number. 
The second peak at 7.5ns corresponds to the falling edge of the clock. At this instant, 
the output of the XOR is read into C. 
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Fig. 7. Simulated supply current: supply current transient of 4 clock cycles (left) and supply 
current characteristics based on 5000 clock cycles (right) 
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Fig. 8. Differential Traces based on 5000 clock cycles generated by 8 successive key guesses. 
Key 46 is secret key 



5 Conclusions 

We have presented a technique to thwart DPA that uses a logic style with data inde- 
pendent power consumption. The technique achieves perfect security whenever the 
layout parasitics are neglected. In our simulation setup, the secret key has not been 
exposed and increasing the data collection is very unlikely to help out. With parasitics, 
DPA is possible. Our simulations however, show that the resulting DT’s are more than 
an order of magnitude smaller than a SC-CMOS implementation. Furthermore in our 
opinion, improvements are still possible. The resulting increased security will as al- 
ways come in a tradeoff with some cost. Here, the cost will be an increase in power, 
area for a more aggressive shielding and an increase in area, initial design time for a 
perfect symmetric standard cell. It is still unclear however, whether a DPA on an 
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actual product will reveal the secret key or not. The measurement setup will suffer 
from measurement errors, a larger resolution in the time domain, supply current fil- 
tering caused by decoupling, supply parasitics and additional large supply current 
noise coming from other modules, which for the non-sensitive parts will have the huge 
supply current variations of SC-CMOS. 

In Table 1, we have also presented the minimum variation that seems achievable for 
any technique at the logic level or higher levels that tries to balance the instantaneous 
power consumption of a module implementing a logic function. Any actual imple- 
mentation will suffer from larger variation coming from not only unsymmetrical intra- 
and inter-cell routing but as well from technology and process variations, over which 
absolutely no control is possible. 
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Abstract. Balanced asynchronous circuits have been touted as a supe- 
rior replacement for conventional synchronous circuits. To assess these 
claims, we have designed, manufactured and tested an experimental 
asynchronous smart-card style device. In this paper we describe the 
tests performed and show that asynchronous circuits can provide better 
tamper-resistance. However, we have also discovered weaknesses with 
our test chip, some of which have resulted in new designs, and others 
which are more fundamental to the asynchronous design approach. This 
has led us to investigate the novel approach of design-time security 
analysis rather than rely on post manufacture analysis. 

Keywords. Asynchronous circuits, Dual-Rail encoding. Power Analysis, 
EMA, Fault Analysis, Design-time security evaluation 



1 Introduction 

The wide-spreading use of processors in security applications, for e.g. in smart- 
cards or Hardware Security Modules (HSM) has increased both the financial and 
social benefits that hackers would gain in tampering with such systems. During 
the past seven years, there has been extensive research to enhance the security of 
such systems. Most of the counter-measures developed were software-based, pro- 
tecting mainly against side-channel information leakage. The performance and 
cost penalties resulting from such counter-measures were affordable. However, 
software protection against more recently-publicised classes of attacks, like those 
involving fault injection, consume considerable memory. 

There is an urgent need to put more focus on the hardware side of the 
system. One attractive path is the use of self-timed or asynchronous circuits. 
In this respect, we have designed, manufactured and tested an experimental 
asynchronous smart-card style device. In this paper, we present the principal 
results of the security analysis of a secure asynchronous processor. We highlight 
the advantages brought by the self-timed nature of the circuit. We also analyse 
some of the weaknesses that we spotted. Hence, we not only present one of the 
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industry’s first thorough stress-testing of a clockless circuit but also propose 
an evaluation procedure for post manufacture analysis. Finally we introduce a 
concept whereby those flaws could have been identified at design level, through 
thorough simulation leading to what we call design-time security analysis. 

We finish this introduction by providing motivations for using asynchronous 
circuits in this field, and give a brief overview of the Springbank test chip (see 
also [1]) and the experimental set-up used. In Section 2 we provide results for 
DPA and EMA before describing, in Section 3, the chip’s resistance to optical 
probing and power glitches. Finally, in Section 4, we give an insight of our first 
results on Design-time analysis. 

1.1 Motivation for Using Asynchronous Circuits 

Speed independent (SI) asynchronous circuits are expected to offer a number 
of advantages over their synchronous counterparts when designing secure sys- 
tems [1,2]: 

Environment tolerance — SI circuits adapt to their environment which 
means that they should tolerate many forms of fault injection (power glitches, 
thermal gradients, etc). This makes fault sensing easier since only major 
faults need to be detected and reacted to. This is desirable since minor fluc- 
tuations in environment conditions are normal during real-world operation. 
Redundant data encoding — SI circuits typically use a redundant encoding 
scheme (e.g. dual-rail). In the latter, each bit is encoded onto two wires AO 
& Al as shown in the table below. This mechanism also provides a means 
to encode an alarm signal (e.g. use 11 = alarm in a dual-rail scheme [1]). 



Al 


AO 


meaning 


0 


0 


clear 


0 


I 


logical 0 


I 


0 


logical 1 


I 


I 


alarm 



Balanced power consumption — Circuits comprising dual-rail (or multi- 
rail) codes can be balanced to reduce data dependent emissions. In the above 
illustration whether we have a logical-0 or a logical-1, the encoding of the bit 
ensures that the data is transmitted and computations are performed with 
constant Hamming weight. This is important since side-channel analysis is 
based on the leakage of the Hamming weight of the sensitive data. 
Fine-grained random timing variation — may be used to make correlation 
of repeated runs more difficult, thereby making signal averaging problematic. 
Absence of a clock signal — no clock means that clock glitch attacks are 
removed. 

Whilst dual-rail coding might be used in a clocked environment one would 
have to ensure that combinational circuits were balanced and glitch free. Return- 
to-zero (RTZ) signalling is also required to ensure data independent power emis- 
sions. Once you have gone to these lengths, it is just a small step to an SI 
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asynchronous implementation which offers the additional benefit of better envi- 
ronment tolerance, i.e. tolerance to fault injection. 




Fig. 1. Springbank Test chip 



1.2 Overview of the Springbank Test Chip 

The Springbank chip was fabricated in the UMC 0.18/im six metal CMOS pro- 
cess. It contains five 16-bit microcontroller processors, various I/O interfaces and 
other units as part of other projects. All five processors are based on the same 
16-bit XAP architecture but with different implementations. The processors are: 
one synchronous XAP (S-XAP), a bundled data XAP (BD-XAP), l-of-4 XAP 
(OF-XAP), l-of-2 (dual-rail) XAP (DR-XAP) and a secure variant of the dual- 
rail XAP (SC-XAP). Given that all processors lie on the same chip and that 
we used the same standard cell library, comparisons do not need to take into 
account technology or foundry variations. A l-of-4 distributed interconnect in- 
terfaces these processors to a standard single-rail SRAM holding program and 
data. In addition, communication between the SC-XAP and SRAM is done via 
a memory protection unit (MPU) with bus encryption; these were disabled in 
the experiments described in this paper. 

Figure 1 shows a picture of the test chip. The SC-XAP is approximately twice 
the area of the synchronous XAP. However, the commercial standard cell library 
used was optimised for synchronous design and not asynchronous design. An 
optimised library might reduce this area penalty to 1.5 times large. Furthermore, 
one must remember that the clocked system requires a clock generator. Clock 
multipliers (PLLs), for example, can take up considerable space. 

1.3 Experimental Set-Up 

The aim behind these tests is to tally the gain in security while moving from a 
conventional clocked design (as implemented on the S-XAP) to an asynchronous 
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dual-rail environment (an example of which is the SC-XAP which bears all the 
features described in section 1.1). For this reason, tests were mainly carried on 
the S-XAP and the SC-XAP. Since we were in a ‘characterisation’ phase, our aim 
was not to break any cryptographic algorithm. We targeted simple instructions 
which gave a good indication of how the hardware reacts to the several tests 
performed. The latter were made on the execution of a simple XOR execution 
whereby we: 

— load the memory address at which data are found, 

— load a first operand (Opl) into a register, 

— perform an XOR between Opl and the second operand (Op2), 

— store result (Res) back to memory. 

To monitor the above execution, after each execution of the above sequence, 
the three data, i.e. Opl, Op2 and Res, were retrieved via the UART port and 
displayed on a monitor (or stored into a file). The tests were carried out in 
a white box configuration, without any encryption mechanism activated. This 
allowed us to thoroughly analyse the benefits and weaknesses of asynchronism. 

In the next sections, we describe the results obtained for each family of 
tests. The interpretation and explanation for those observations are then detailed 
accordingly. 



2 Side Channel Analysis 

In this section, we look at the information leakage through two forms of side- 
channels: one is by studying the power consumed by the processor (Differential 
Power Analysis - DPA [7]) and the other is by observing the Electro-Magnetic 
(EM) waves emitted by the processor (Electro-Magnetic Analysis - EMA [8,9]). 



2.1 Differential Power Analysis 

Power dissipation in static CMOS circuits is dominated by switching activity. 
As a result, the power dissipated is highly dependent on the switching activity 
produced by a change of input data. In the simple case of a bus, activity is 
observed as the Hamming weight of the state changes. Data-dependent power 
leakage may be exploited to reveal useful information, either by analysing single 
power traces (Simple Power Analysis) or by collecting many power traces and 
performing a statistical analysis of the power variation with respect to changes 
in data values (Differential Power Analysis [7]). 

DPA Attacks on the Springbank Chip. Power analysis of the secure dual- 
rail processor revealed that small imbalances in the design of the dual-rail gates 
allowed some data-dependent power leakage to be observed. The XOR opera- 
tion provides one example of where data-dependent power consumption may be 
observed. Power traces were collected for two different XOR operations, through 
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experimentation. The operands for the first XOR instruction were 0x11 and 
0x22, these were changed to 0x33 and 0x55 for the second. 

Figure 2 shows the results of collecting power traces for each operation, av- 
eraging the traces over 4000 runs, and then subtracting one averaged trace from 
the other. The centre curve represents this difference. The small disturbance, left 
of centre, is the result of data-dependent differences in the power requirements 
for the two XOR operations. 

The same kind of analysis was carried out on the S-XAP and similar data- 
dependant information leakages were observed. However, the extent of the leak- 
age was more significant in the case of the clocked XAP compared with the 
asynchronous one. More detailed measurements showed that the data depen- 
dant information leakage of the SC-XAP was lower than that of the S-XAP by 
about 22 dB. This reduction is not sufficient to completely protect against DPA. 
However, in other cases, we have seen that a reduction in information leakage 
by 20-24 dB could neutralize leakage with respect to SPA. 

Further reducing the data dependant power leakage. This example is 
a good illustration of the difficulty of designing secure processors. On paper, 
the SC-XAP seemed breachless thanks to its dual-rail with RTZ implementa- 
tion. However, when it came down to implementing this scheme, conventional 
place & route tools were used. Those tools tend to optimise space which means 
that if a bit is encoded onto two ‘wires’, one wire might end up being longer 
than the other creating an imbalance which could produce power leakage. So 
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Fig. 2. Differential Power Analysis on Secure XAP (experimental graph) 





142 



J.J.A. Fournier et al. 



a first improvement would be to either have a full-custom design or develop a 
place and route tool which understood how to balance signal paths. Further im- 
provements could be made at a transistor level: current standard cell libraries 
typically optimise the transistor sizing of gates to minimise the delay through 
the gate rather than ensuring that the capacitance across all inputs is identical. 



2.2 Electro-Magnetic Analysis 

In this case, the tests performed were similar to the ones for the DPA, but 
this time, for each XOR execution, we measured the Electro-Magnetic (EM) 
waves emitted by the active processor (asynchronous or clocked one) [8]. For 
the SC-XAP, the EM signals collected were of exploitable magnitudes, which 
allowed successful DPA-like treatments to be carried out on the collected refer- 
ence curves. Both for the SC-XAP and S-XAP, data dependant ‘signatures’ were 
obtained at three places: at the load of Opl, at the XOR execution and at the 
write-back of the XOR result into memory. This is illustrated in Figure 3. 

The EMA results were taken without signal averaging since a signal was 
clearly visible above the noise. In Figure 3, the uppermost curve is an example 
of the EM signals measured and the lower three ones correspond to DEMA curves 
obtained by performing, on the EM curves, differential analysis [8] with respect 
to Opl, Op2 and Res. The ‘peaks’ shown must be interpreted as leakage of the 
data’s Hamming. We also clearly see at what instances the three different data 
are ‘manipulated’. For the results presented here, a coil covering the processor 
was used, which means that we were capturing the ‘global’ EM waves of the 
entire SC-XAP. However, one could envisage using smaller coils (e.g. 40/rm) to 
measure emissions from a smaller area [8] and target the exact region where the 
data is being manipulated (the data bus or the ALU for example) . 




Fig. 3. DEMA Results 
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EMA leakage on the SC-XAP. As for DPA, balanced logic was be used as 
a countermeasure for EMA. And like in Section 2.1, the imbalance introduced 
by the design tools used has been lethal. Moreover, with EMA, we did not 
observe the same 22 dB reduction in the amount of information leaked because 
EMA is able to isolate much finer circuit areas and hence the placement and 
routing of components becomes far more critical in achieving a balanced design. 
In addition to this, the absence of the clock in asynchronous circuits eases EMA. 
In conventional clocked circuits, the clock usually adds noisy components to the 
EMA signals captured whereas in asynchronous circuits, we no longer face that 
inconvenience. 

To make the EMA measurements more difficult, a top level metal defence 
grid may be used. These are seen on modern smart cards and if suitable signals 
are injected into them they can help mask the underlying activity. 

Where operations may be performed in more than one place (e.g. if using a 
dual execution pipeline), non-determinism may be used to make data collection 
more difficult. Security evaluation of this approach is most tractable when the 
attacker is known to have limited resources, for example one EMA trace taken 
from one sensor. However, when multiple runs and multiple sensors are used 
the evaluation is far more complex and is dependent on the algorithm being 
executed. We are also investigating geometrically regular structures (e.g. PLAs) 
to determine if this approach to design is more secure than a conventional ASIC 
design flow. 

3 Fault Injection Analysis 

A second class of stress-testing techniques consists of injecting faults into the 
device in order to obtain exploitable ‘abnormal’ behaviours. Injecting faults into 
working processors can change the nature of data being treated or corrupt cryp- 
tographic computations in such a way as to unveil secret information [6]. 

Early forms of these so called active attacks were focused on the device’s 
external interface and often involved introducing glitches on power or clock in- 
put pins [10]. Changes in temperature, either by cooling or heating the whole 
device or the introduction of a temperature gradient, may also be used to induce 
faulty behaviours. Defences against such attacks are simplified by the restricted 
nature of the channel by which faults are injected and can easily be detected by 
incorporating a suitable tamper sensor. Far greater control over the nature of 
the faults injected has been demonstrated recently. 

Two approaches were taken to inject faults into the Springbank: the first one 
was by optical probing and the second one was by injecting power glitches. As 
for the previous side-channel analysis, we targeted an XOR operation. This time 
we worked only on the SC-XAP. 

3.1 Optical Probing Techniques 

Laser radiation with a sufficiently short wavelength (photon energy) and inten- 
sity may be used to ionise semiconductor materials. When ionisation occurs in 
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a depletion region the production of additional carriers and the presence of an 
electric field (built-in field and any reverse bias) causes a current to flow. This 
photocurrent is capable of switching other transistors whose gates are connected 
to the illuminated junction. This process is a transient one where normal circuit 
activity resumes once the light source is removed. 

In addition to what may be considered a useful attack mechanism, negative 
effects are also possible. These include the possibility that latch-up may be in- 
duced by the generation of photocurrents in the bulk. Of less concern, when 
using readily available infra-red and visible laser light sources, is the ionisation 
of gate- and field-oxides due to the large band gap energy of silicon dioxide 
(which would require a laser with a wavelength in the UV-C range). Ionisation 
of this type is common when higher energy forms of radiation are absorbed. 
The subsequent accumulation of positive charges results in a long term shift in 
transistor characteristics. The following sections explore the weaknesses of the 
dual-rail technology employed in the Springbank test chip. We then introduce a 
number of improvements that could secure the design against such attacks. 

Optical Probing Attacks on the Springbank Chip. If the dual-rail imple- 
mentation had provided a completely fault-secure design all attempts at inducing 
faults would have resulted in deadlock. In many cases the processor did propa- 
gate an error signal resulting in deadlock. Unfortunately, two weaknesses in the 
current design were revealed by the experiments. 

The first involved the injection of faults into the ALU design. By targeting 
two different regions within the ALU two different fault behaviours were possible. 
The first was to disrupt the ALU operation to produce an incorrect result, the 
second forced the ALU to always return the result 0x0001. These results are 
possible as some of the dual-rail gates within the ALU do not guarantee that 
the presence of the error state on their inputs (in this case a logic-1 on both 
dual-rail wires) is propagated. This was a known and unfortunate concession 
made at the design stage. 

Perhaps more interesting is the second failure behaviour. In this case it was 
possible to set the contents of the processor’s registers. The exposure of a single 
register cell to laser light reliably resulted in setting its value to a logic- 1. The 
dual-rail register design that was used in the Springbank chip is illustrated in 
Figure 4. Setting the cell to T’ was made possible by its inability to store an 
error state (both states of the single flip-flop are valid). The precise mechanism 
by which the register was set first involved both the outputs of the NOR gates 
being pulled- low. This happened as a result of the laser producing photocurrents 
in the junctions of the N-type transistors in both gates. When the laser was 
removed, the flip-flop resolved into the logic- 1 state due to differences in the 
threshold values for each gate. N-type transistors in general produce much larger 
photocurrents due in part to the superior mobility of electrons when compared 
to holes ( The electrons and holes in this case are the minority carriers on the 
larger side of the depletion region. The depletion region extends mostly into the 
region of least doping.). It is important to note that the attack was successful 
even with a large spot size exposing many transistors (we estimate around 100). 
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Optical Probing Countermeasures. The vulnerability of the ALU design 
may be countered by ensuring that an error state on any gate input is always 
propagated to its output. An approach to providing a secure dual-rail register 
design is shown in Figure 5. Here the number of flip-flops has been doubled. The 
four possible states are now split into two valid states (representing zero-symbol 
and one-symbol) and two error states (null encoded as 00 and error as 11). When 
the register is reset it is forced into the error state, this prevents the possibility 
that the reset signal may be used as a simple way to reset the contents of a 
register to a valid state (perhaps by targeting a reset signal buffer). The error 
state will only be propagated on the register’s output if the register is read. For 
correct operation, the register must be written with a valid data value prior to 
reading. The register is also designed to produce an error signal if a ‘null’ state is 
ever stored. The ability to store a null value may assist an attacker by allowing 
them to inject an actual data value from another source. 

We will now consider a number of different attack scenarios and how the 
design is able to detect the injected faults. 
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Fig. 5. Dual Flip-Flop Dual-Rail Register 
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Initially we consider an attack similar to the one described above. Here a 
large number of transistors are exposed and force all the gate outputs to a logic- 
0. By guaranteeing that both flip-flops resolve in the same direction, the resulting 
state of the register will always be one that represents a fault (null or error). 
Attempts to target and modify a single flip-flop will again always result in a fault 
state. A successful attack now requires greater control over the fault injection 
process. For example, if both flip-flops could be exposed and then the laser spot 
moved up away from the lowermost NOR gate, a valid data value could be written. 
Independent and simultaneous control over individual transistors also offers the 
possibility of setting registers to particular values. 

Security may be further improved by including small optical tamper sensors 
within each standard cell. These sensors, constructed from one or two transistors, 
would normally play no part in normal circuit behaviour (only adding a small 
amount of capacitance). Their only function is to force the dual-rail outputs 
of the gate into an error state when illuminated. A similar approach is already 
taken in many standard cell libraries to protect against plasma-induced oxide 
damage during manufacture, in this case an antenna diode is added on every gate 
input. These ideas together with security-driven place-and-route would again 
increase the level of controllability required to perform a successful modification 
of register values. 



3.2 Power Glitch Attacks 

Power glitches may be used to inject faults at a coarse level. Tests on the SC- 
XAP revealed that it was resistant to short Vcc glitches which went down to the 
ground rail and back. For longer duration glitches we observed faulty processor 
behaviours which could constitute favourable conditions for the cryptanalysis 
of cryptographic algorithms like the DES or RSA [5,6]. By injecting the power 
glitch at different times, we succeeded in causing specific parts of our small 
program to malfunction. The interesting thing to note is that if we want to 
target, say, the load of Opl instruction, we synchronize our program so as to 
‘cut’ the power just at that instant and resume it several tens of nanoseconds 
later in such a way that the normal program execution resumes. The effects of 
the glitch are monitored through the power consumed by the processor, just like 
for the Differential Power Analysis. This is illustrated in Figure 6 which is a 
superposition of the power traces: one in the normal mode and one where we 
introduced the power glitch. 

If we synchronize the curves and zoom in as shown in Figure 7, we see that 
the real impact is on the LOAD of Opl execution. In this case, the value read 
as Opl was always OxFFFF. Consequently, the result of the XOR operation 
was always the logical inverse of Op2. In this case, we have targeted one precise 
instruction and corrupted an entire data. This scenario could be lethal if we were 
to attack the load of a DES key for example. 

In another experiment, we generated the glitch while uploading the data’s 
address into the address register. This led to the ‘writing’ of an erroneous address. 
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Fig. 6. Vcc Glitch on Secure XAP 




Those are a few examples of how ‘long’ glitches can corrupt the functioning 
of a self-timed system. We are currently looking into this aspect. 

4 Design-Time Security Analysis 

We have seen that in many ways, the SC-XAP exhibits several interesting secu- 
rity properties like lower data-dependant power leakage and resistance to optical 
probing for most of the processor. We did not predict that the asynchronous na- 
ture of the circuit could facilitate attacks like EMA and voltage glitches. More- 
over, other security flaws, like the low resistance of the ALU and the register bank 
to optical probing, are linked to unfortunate design trade-offs in the SC-XAP. 



148 



J.J.A. Fournier et al. 



The design and evaluation of the Springbank test chip is typical of the de- 
sign process for secure processors. We began with a requirements specification 
which included security properties. This allowed us to identify key design crite- 
ria which steered the design process. However, we lacked design time validation 
of the security criteria and we now know that some side cases were overlooked. 
Even more worryingly, our colleagues working on attack technologies developed 
new attacks which we had not even considered during the design process. What 
we seem to have recreated in our research project was a microcosm of current 
industrial practice. 

Dissatisfied with ad hoc evaluation post design, we have begun a research 
programme to investigate design time security validation techniques. In the last 
section of this paper, we give an insight of the on-going work about Design-time 
Analysis which is bound to become important for the future design of secure 
processors. Design-time analysis is performed during the design process whereby 
we should try to simulate the behavior of the processor along with the current 
consumed and the energy radiated. 

4.1 Simulating Side- Channel Information Leakage 

To confirm the source of the imbalance observed during the side-channel anal- 
ysis, we simulated the operation of the ALU executing the same XOR opera- 
tions described in Section 2.1. The power simulation results were collected using 
Primepower"'"'^ [4], a gate level power estimation tool. The power estimation 
includes capacitance and resistance values extracted from layout. The results of 
the simulation are shown in Figure 8. Even in the absence of a power model 
for the memory system we observe a similar data-dependent power difference 
during the execution of the XOR operations. Using a simple second order low 
pass filter model for the power distribution network provides more comparable 
data (Figure 9). The power dissipation curves for the XOR operations differ in 
shape to those measured as they include no power for memory accesses. This 
produces a significant drop in power at the point where one XOR operand is 
fetched from memory. 

We hence see how data dependant leakage may be detected at design time via 
systematic simulation. Such simulations allow design comparisons to be made, 
though it is harder to predict the exact values of emissions. The simulations 
we have undertaken for power are based upon switching activity. In this case, 
capacitance masks some of the information. Similarly, for electromagnetic radi- 
ation, one has to consider wave interference. None the less, switching activity 
simulation gives a good approximation to the energy being consumed over time 
which is a good approximation for DPA and DEMA. 

4.2 Design-Time Analysis of Fault Tolerance 

A range of physical phenomena that can trigger faults may be modelled. We can 
then model a wide range of attack scenarios from single to multiple transistor 
failures. Given bounds on the control the attacker has, we can determine whether 
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DRXAP Current Comparison over XOR (H#1 1 xor H#22 vs. H#33 xor H#55) 




Fig. 8. Power Simulation. Secure XAP executing XOR 




Fig. 9. Power Simulation. Secure XAP executing XOR, low pass filter applied 

a fault can be injected without being detected. No matter what the source of 
the fault is, the end result is to somehow modify the data being manipulated. 
The aim of the game is to detect any attempt to cause bit flips and this is being 
investigated right now: had we identified, at simulation level, that no alarm signal 
was propagated when the ALU or the registers were tampered with, we would 



150 



J.J.A. Fournier et al. 



1 q)€APS Low-pass Filtered Current Comparison over XOR (H#1 1 xor H#22 vs. H#33 xor H#55) 




Time (ns) 



Fig. 10. Power Simulation. Synchronous XAP executing XOR, low pass filtered 

have redesigned those weak parts of the circuit. Systematic testing of faults can 
then be undertaken at a small module level through exhaustive simulation. Such 
simulations can take into account a wide range of conditions (e.g. single and 
multiple transistor failure induced by optical probing) in much the same way 
that traditional fault simulation is undertaken. Where alarms are generated at 
the small module’s level, it is then possible to reason about the propagation of 
alarm signals at a more abstract level for larger systems. 

5 Conclusion 

This paper has presented the first ever security evaluation of an asynchronous 
smart-card system. The secure asynchronous processor (SC-XAP) has shown 
interesting tamper-resistance properties. None the less we have identified weak- 
nesses in our first attempt together with possible refinements to overcome these 
issues. 

Asynchronous circuits could become a trustworthy platform for secure com- 
puting. Circuit area is inevitably going to be larger than a simple synchronous 
design, but this has to be balanced against large memory (and thus chip area) 
savings that are possible if fewer software countermeasures are required. The lack 
of EC AD tool support for asynchronous circuit design is another issue, though 
we were able to make use of commercial place & route tools, standard cell li- 
braries, etc, and we were able to complete the design with just a small research 
team. 

Finally, we mention the concept of design-time security analysis. These tech- 
niques are centered around the simulation of a wide range of measurements and 
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fault possibilities for a wide range (preferably exhaustive) set of input data. We 
have demonstrated that power attacks and optical probing can be simulated. 
However, our longer term goal is to be able to make more general statements 
about the level of security attained which go far beyond current known attacks. 
With such an approach, we believe that security by design may become a far 
more powerful technique for processor designers. 
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Abstract. There are many applications for true, unpredictable random numbers. 
For example the strength of numerous cryptographic operations is often depend- 
ent on a source of truly random numbers. Sources of random information are 
available in nature but are often hard to access in integrated circuits. In some 
specialized applications, analog noise sources are used in digital circuits at great 
cost in silicon area and power consumption. These analog circuits are often in- 
fluenced by periodic signal sources that are in close proximity to the random 
number generator. We present a random number generator comprised entirely of 
digital circuits, which utilizes electronic noise. Unlike earlier work [11], only 
standard digital gates without regard to precise layout were used. 



1 Introduction 

True random-number generators are often desirable in many applications ranging from 
statistical system analysis to information security protocols and algorithms. Currently 
available true random number generators utilize circuitry that often consumes signifi- 
cant resources on integrated circuits and/or require incompatible analog and digital 
elements. Other, more primitive generators do not provide a convenient interface to 
electronic devices. In this paper we describe the design of a new type of true random 
number generator that is based solely on digital components (i.e. it is inexpensive to 
build), consumes little power, provides high throughput, and passes the DIEHARD 
[15] suite of tests for randomness. It is envisioned that this design of a random number 
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generator will provide an inexpensive alternative to generators that are currently em- 
bedded in many systems such as microprocessors and smart cards. 

This paper introduces the concepts behind a simple true random number generator. 
The focus of this paper is on digital circuits that exhibit metastability and those that 
function as unstable oscillators. Others have used metastability as a source of random- 
ness and have found some success, but only after trimming devices with a laser to 
achieve a precisely balanced circuit. (See, e.g. [10], [11], [12]). 

We designed and implemented a wide array of this type of true random number 
generators. The prototype chip consists of nine distinct designs. The prototype chip 
was mounted on an acquisition breadboard for testing. The extracted results were 
analyzed to determine which designs yield the best results. The results show that even 
without de-biasing the resulting sequences, some of the designs provide random da- 
tasets that pass the DIEHARD suite of tests. This paper details the methodology for 
design, implementation and testing of a true random number generator based on digital 
artifacts. We conclude this paper by providing results, and outlining the necessary 
steps to create practical versions of these promising designs. 



2 Description of Digital Artifacts 

The design described in this paper yields random results by means of digital circuit 
artifacts [13], [14]. The design utilizes a pair of oscillators that are permitted to free- 
run. At some point, the free-running oscillators are coerced to matching states via a bi- 
stable device. While we believe that the circuit utilizes the metastability artifact, this 
conjecture has not been proven. However, we have experimental results from a similar 
circuit composed of discrete components, which does show metastability. A discussion 
of the two possible causes of randomness, metastability and oscillator drift and jitter, 
is presented below. 



2.1 Metastability 

Digital circuits, by their nature, are designed to be predictable. If the same logic levels 
in the same order are presented to digital circuit, the same result should always occur. 
However if the rules governing the inputs of digital circuits are violated then the re- 
sults can become unpredictable. In particular, the metastability phenomenon may 
occur in a digital flip-flop or in a latch. A flip-flop, which is a memory element that 
can store one bit, is one type of bi-stable device. At the heart of every flip-flop is a 
pair of logic gates that are fed back to each other. This electrical feedback is what 
preserves the logical bit stored in the circuit. If the setup and hold conditions of the 
flip-flop are violated [1] then the pair of gates will behave unpredictably or even os- 
cillate about some intermediate voltage [2] . During the oscillatory or metastable state, 
the output is neither a logical zero nor a logical one. After some time the oscillations 
will die out and the flip-flop will settle into a logical state of a zero or a one as shown 
in figure 1 . 
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There are three uncertainties at work here. The first uncertainty is if the circuit will 
behave normally i.e. attain a logical state after the usual delay for the flip-flop, or will 
the circuit enter the metastable state [3]. The second uncertainty is the state that the 
flip-flop will settle into. The third uncertainty is the length of time that the circuit will 
remain metastable [1]. Some of these uncertainties have been measured [4] and mod- 
eled [5] for various kinds of circuitry and environmental conditions. It has also been 
shown that this phenomenon cannot be avoided in any flip-flop [4], [5]. Thus, the 
metastability phenomenon provides a source of randomness that can be used to con- 
struct a true random number generator without the need for specialized analog circuits. 
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Fig. 1. Recovery from metastable statedisplayed on a TDS7254 Tektronix oscilloscope using 
the P7240 active probe. The signal resolving from a metastable state was captured using infinite 
persistence in the Fast Acquisition mode [y-axis: 1 volt/division with 8 divisions shown; x-axis: 
2ns/division with 10 divisions shown]. Lower intensity (lighter traces) shows more frequent 
signal traces. Note the single trace where the signal oscillates toward zero but eventually stabi- 
lizes at a one 



2.2 Oscillator Drift and Jitter 

A clock period is never a precise constant, even in highly regulated clock oscillators, 
such as those that utilize a crystal. Careful observation shows that the oscillating signal 
has slight changes of phase. Such perturbations of oscillator period are called jitter. 
Some components of jitter are random [6]. Precise measurement of jitter is more of an 
art that a science [7]. 
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The design uses two free-running (without a crystal or similar reference) oscilla- 
tors, which are allowed to drift away from each other. Such oscillators are known to 
exhibit a great deal of variability even over short periods of time. Thus, the different 
internal noise of the two similar oscillators (causing phase jitter, accumulated as phase 
drift) can be utilized as a source of randomness. After some time the instantaneous 
voltage is latched by a digital bi-stable circuit, capturing the random state. 



3 Integrated Circuit Design 



In order to validate these 
concepts, an integrated circuit 
containing nine distinct types (or 
styles) of random number 
generators was constructed. Each 
design utilized a bi-stable device, 
which in most cases contained a 
pair of gates to form the memory 
element. Given the obvious 
dependence on circuit delays each 
style of random generator was 
replicated in 15 to 31 different 
varieties for a total of 247 distinct 
random number generators. The 
varieties used different gate sizes, 
and thus different circuit delays, 
in pairs of matched or sometimes 
unmatched gates in an effort to 
explore the entire problem space 
for each of the nine styles. All of 
the gates were drawn from a 
standard 0.18 micron CMOS 
library and laid out automatically. No effort was made to minimize or match wiring 
delays, although given the small size of the circuits it is likely that some varieties are 
very similar. 

A multiplexer was utilized to allow for the selection of a particular variety for that 
style of generator. Every variety from a particular style was also connected to a net- 
work of XOR gates thus combining all of the varieties of a style into a single output. 
This XOR output became one of the inputs to the style multiplexer as well. Any single 
variety or the XOR of the varieties in the style could be selected via a control word as 
shown in figure 2. 

Each of the styles was then fed to a zone multiplexer. As with the varieties, all of 
the styles were then XOR’ed together to create another input for the zone multiplexer. 
A control word was configured to choose any particular style or the XOR combination 




Fig. 2. Zone circuit; Selection of various RNG 
designs was designated through multiple levels of 
multiplexers 
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of the all nine styles (see figure 2). Ultimately 288 variations could be individually 
selected: 

• A specific variety of a specific style of the random number generator 

• The XOR value of all varieties of a specific style 

• The XOR value of specific varieties combining all styles 

• The XOR value of all varieties of all styles. 

By including a wide variety of designs, numerous combinations of digital circuit 
artifacts were tested to determine if a real world random number generator could be 
created via a single circuit design or a combination of designs. 

Once a bi- stable device becomes metastable its non-logic voltages can propagate 
throughout a circuit and cause other flip-flops in the circuit to become metastable as 
well. While it is desirable that the random number generator has an unpredictable 
output the same cannot be said for the typical circuit that requires the random bits. 
Figure 2, includes a synchronizer circuit that effectively isolates the random number 
generator from the rest of the circuit by capturing the bits in a series of three flip-flops. 
Typical flip-flops will not enter a metastable state easily. However, even if the first of 
these flip-flops were to enter the metastable state, the clock period is sufficiently long 
so that it is likely that the flip-flop will have left the metastable state and resolved to a 
zero or one prior to the needed setup time for the second flip-flop. Similarly, the sec- 
ond flip-flop protects the third one. This method is a well-known technique [2] for 
reducing chances of failure due to metastable behavior. The chances of metastable 
behavior propagating through the synchronization circuit can be measured in tens or 
hundreds of years. 

In the interest of creating the greatest possible set of variations, the entire zone cir- 
cuit that is shown in figure 2 was replicated eight times. By purposely replicating the 
design numerous times, further variations in the circuit layout may have been intro- 
duced. The outputs of all zones were directed off chip for analysis. 



4 Circuit Description of a Random Number Generator 

Nine different styles or types of the random number 
generator were implemented in the test chip. Results 
shown in Appendix A, Table 1 found that six of the styles, 
designated T1-T6, of the random number generator failed 
completely by giving only a single binary value of one or 
zero. These six styles attempted, in different ways, to 
cause a library flip-flop to become metastable by violating 
the setup and hold requirements. The failure of these styles 
to generate truly random sequences can be partially 
attributed to the fact that modern flip-flops are designed to 
suppress the metastable artifact. Of the remaining styles, 
one (T9) gave results that failed to pass all of the 
DIEHARD Tests. Two styles (T7 and T8) produced results 
Fig. 3. The T7 concept were random when all of the varieties were XOR’ed 
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together and sometimes from a single random number generator circuit. Since the T7 
and T8 designs were quite similar, the focus of the remaining testing effort was on 
style T7. Figure 3 shows the basic principle behind the design of T7. Later research 
may explore the usefulness of designs T8 and T9 and discover why T1-T6 failed to 
produce random results (see appendix A). 



4.1 Theory of Operation 

The basic concept behind style T7 is relatively simple. The design consists of two 
inverters and four switches, numbered 1 through 4, as shown in figure 4. 




Fig. 4. Operating phases of T7 

The switches can be implemented as transition gates or as multiplexers. The actual 
design was implemented as multiplexers as shown in figure 5. In this case the multi- 
plexers served as the delay elements as well as the switching elements. 

Returning to the switch based design in figure 4a, if switches 1 and 4 are closed 
while switches 2 and 3 are open a configuration is created where the inverters form 
two independent, free-running ring oscillators, caused by the delay in the negative 
feedback loop. The inverters can be supplemented by delay elements implemented as 
a series of buffer gates or an even number of inverters. Ideally each of the oscillators 
should be sufficiently different so that if the circuit encounters a strong external signal 
there is little chance that both oscillators would synchronize to it. 
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If switches 1 and 4 are opened while switches 2 and 3 are closed the connected 
inverters form a hi-stahle memory device as shown in figure 4b. Because of positive 
feedback the outputs of the inverters eventually resolve, by clipping, to a consistent 
logic state. This final logic state creates one random bit. The randomness of the bit is 
derived from the conditions created when the two free-running oscillators are stopped 
(the oscillator feedback loops are opened and the two inverters get cross-connected). 
At that point the relative and absolute values of the instantaneous output voltages and 
the internal noise determine the eventual logic state the circuit will settle to, some- 
times even via the artifact of metastability. Thus, the randomness of the circuit ex- 
ploits two different mechanisms. 

4.2 Drift and Jitter 

The two oscillators drift apart in different ways because when the oscillators are 
switched on they don’t start immediately, they "hesitate" for a short period of time 
then the voltage goes either up or down, creating an uncertain starting point. The loop 
gain of the oscillators should be small otherwise they will start instantly. As the two 
oscillators continue to oscillate random circuit noise affects the unregulated oscillators 
so that their clock periods are inconsistent from cycle to cycle. The combined effects 
ensure that the two oscillators will find themselves in different states each time they 
are stopped. 

4.3 Metastability 

When the oscillators are stopped, the gates are cross-connected forming a bi-stable 
memory element. At this point one or both output voltages may be between logic 
levels. The bi-stable device, which is formed by the inverters, must settle to a logic 




Fig. 5. Actual T7 design based on bi-stable memory element. The random sequence generator 
is a pair of cross-connected inverters where the multiplexers are used to switch between 
oscillatory and bi-stable phases 
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State over time. However, sometimes the bi-stable device will initially find itself in a 
conflicted state as the two oscillators do not agree or even arrive at a consistent state 
of disagreement. The bi-stable device will therefore oscillate for a period of time until 
a final, stable state can be achieved. Thus the final state of the metastable device pro- 
duces a random bit. 

4.5 Circuit Details 

There were several varieties, of the design laid out for the test chip. The types of 
inverters were varied to achieve different delays. One option not explored was to re- 
place a single inverter with a chain of inverters for larger delays. Short (and different) 
delays are preferred because the ring oscillators will produce smaller amplitude, sinu- 
soid like signals, which should provoke metastability more often. 

Very short delays might prevent oscillation if the gain of the circuit is too small at 
the fundamental frequency of the feedback loop. However, the circuit should work 
even in this case. The large negative feedback forces the output of the inverters to an 
intermediate voltage, close to halfway between logic levels. Flipping the switches to 
activate the bi-stable configuration at an intermediate voltage often forces metastabil- 
ity [10], [11]. The final state achieved under these conditions is shown to be unpre- 
dictable if the initial conditions of the bi-stable circuit are at intermediate voltages. 
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Fig. 6. Random bit acquisition; The ‘Select’ signal is used to drive the multiplexers that 
choose between acquisition and oscillation phase of the random number generator. 



In the actual design a synchronous circuit was used to collect random numbers. 
Figure 6 shows how a divided version of the clock was used to switch the multiplexers 
via the select signal. When the select signal is high the circuit is in the oscillation state 
and each inverter operates independently. At that point the output is not in any logic 
state as is indicated in by the diagonal crosshatches. Sufficient time is allowed for the 
oscillators to diverge in the oscillation state. When the select line goes low the circuit 
is in the bi-stable configuration and resolves to a single value, either a 1 or 0, via me- 
tastable oscillations. The resolution time can be short or long but sufficient time must 
be allowed in order for the value to be resolved on the vast majority of occasions. 
Should the random bit be unresolved the synchronization circuit described in section 3 
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will resolve it. The rising edge of the select signal can be used to acquire the random 
bit as shown in figure 6. 



5 Improving Randomness 

Variations within the manufacturing tolerances and unpredictable environmental 
changes (temperature, supply voltage etc.) will alter the behavior of the randomness 
circuit. The circuit may fail in an obvious way such as producing all 0 or all 1 output 
or possibly a heavily biased sequence. The simplest solution to this problem is to lay 
out many, slightly different versions of the circuit, on the chip. Under different envi- 
ronmental conditions some of the versions will work randomly while others will pro- 
duce a biased output. If the differences in the circuits are small and there are a large 
number of varieties of the randomness circuit, with high probability at least one vari- 
ety will always produce random results. 

The randomness of all of the different varieties is collected by simply XOR’ing all 
of the outputs of the random circuits on the chip [8]. If there is at least one truly ran- 
dom output sequence the final result will still be random, as shown by Matsui [9]. 
Since each random source is small, only a few hundred standard logic gates will pro- 
vide very high quality random numbers in real world conditions. 



6 Data Gathering and Analysis 

For each of the random number generators tested a data sequence of 80 megabits was 
collected. Each sequence was submitted to the DIEHARD Tests for evaluation. 

The DIEHARD Tests is a collection of 16 individual tests that altogether produce 
215 results (called Pvalues) Each Pvalue is obtained by applying a function 
[Pvalue = E,(X) (i = 1-215)], where the function E, seeks to establish a distribution 
function of the sample random variable X as uniform between 0 and 1 . In addition all 
of the functions E, are an asymptotic approximation, for which the fit will be worst in 
the tails. Thus only rarely will one find Pvalues near 0 or 1, such as 0.0012 or 0.9983, 
if the sequence is random. When a sequence is decidedly non-random numerous Pval- 
ues of 0 or 1 to six or more decimal places (i.e. 1.000000) will be present. Thus for a 
random sequence, only a small number of Pvalues should have a value near zero or 
one, although the presence of occasional Pvalues of 1.000000 or 0.000000 is insuffi- 
cient to suggest that a sequence is non-random. 

Accordingly, we established a scoring system for the DIEHARD results, where se- 
quences were declared random provided that: 

1. There were no more than a single “hard failure” (Pvalue = 1.000000 or 
0.000000) for the entire sequence of Pvalues. Two failures were permitted 
if, and only if, both failures occurred in different varieties of a single 
DIEHARD test. 
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2. There were fewer than 5 “near” failure values (where a “near” failure is a 
Pvalue < 0.000099 or Pvalue > 0.999900) 

If a sequence exhibited a small number of “hard failures” or more than 6 “near” 
values the results were considered as “indeterminate” and a retest was performed with 
a new dataset. If the retested data produced acceptable results, the generator was con- 
sidered acceptable. Just to be certain, we also looked for “almost” values (Pvalue < 
0.099999 or > 0.900000) and clusters of similar values. In general failed sequences 
had many “hard failures” to the point where few “near” or “almost” values exist in the 
entire sequence. Passed sequences almost never had a “hard failure” and very few 
“near” values. 

Ten different chips were received from the CMOS manufacturing facility. Four 
distinct tests were performed to confirm the results. The first chip was tested twice for 
each XOR result at each voltage. Additionally, a second chip was tested using another 
prototype board and the results were compared and verified. While some DIEHARD 
results produced “hard failures”, there were never more than two of such failures in 
any given sequence. Occasionally, two “hard failures” occurred in the same test of the 
DIEHARD suite of tests. No repetition of “hard failures”, or “near failures” was found 
among the four test runs. The results obtained from the four distinct test runs show 
that the same design tested in identical configurations produced different and random 
results. 



7 Results and Interpretation 



As explained previously, six of the nine proposed designs failed to produce any non- 
trivial bits (the output remained at constant 1 or 0 value). One design (T9) produced 
results that failed to pass DIEHARD Tests. Two designs (T7 and T8) produced results 
that were random when all of the varieties were XOR’ed together. In addition, T7 
produced random results from a single generator circuit at the nominal operating volt- 
age of 1.8 volts (see Appendix A, Table 1). Since designs T7 and T8 were closely 
related, testing and analysis focused only on the T7 design. 

In order to simulate real world conditions where gate delays can vary due to volt- 
age, temperature and processing differences, the circuit was tested at various voltages 
as is shown in Appendix A, Table 2. In all cases the XOR of the 15 varieties of design 
T7 produced random sequences. Intuitively this seems to indicate that as some varie- 
ties became more biased, other varieties became less biased at different voltage levels. 
However all varieties produce independent, random results. The XOR appears to rep- 
resent a collection of the entropy of all varieties, which can be explained as follows. 

We define the bias b of & random binary variable X as 



b = 



Prob(Z = 1) 



1 / 

72 



Prob(Z = 0) - 



1 / 

72 



The T7 design uses the XOR of 15 binary sequences, each of which is believed to 
be random, as the output. According to the “Piling-up lemma” [9], if these input se- 
quences are independent, the bias of the output sequence is 
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b = r'X[b, 

Here n is the number of input sequences and b. ( 1 < i < n) is the bias of each of the 
input sequences. It can be easily shown that b <b.(l <i <n) and the equality holds if 
and only if b. = 0 or b- =¥2 for all j i. In general, the XOR will greatly reduce the 
collective bias of the independent input sequences. For example, assume n = 15 and b. 
= ¥4 for all i, then b = (¥ 2 )'^ ~ 0.000015, which is negligible. As a practical matter the 
experiments show that even when none of the individual sequences passes the 
DIEHARD test, the XOR of the results is still measured as random. Since the cost of 
the circuit is quite small, a further reduction in bias can be achieved via XOR’ing two 
of these circuits together. 

Further off-line processing of longer sequences from circuits that did not pass the 
DIEHARD tests provides an indication of the randomness of those circuits. Specifi- 
cally, when a simple Von Neumann corrector was applied to long sequences that 
failed, the resulting shorter sequences passed the DIEHARD suite of tests. The Von 
Neumann corrector is used in many applications to remove bias [16], [17] at the ex- 
pense of lower throughput. This shows that aside from bias, the results gathered from 
different varieties of T7 will provide good sources of random bits when properly de- 
biased. This also seems to suggest that the effects of voltage variations on different 
varieties of T7 are most pronounced in the bias within the sequence. By XOR’ing 
different varieties of T7, the effects of bias are reduced to tolerable levels as shown in 
Appendix A, Table 3. In essence the XOR function has allowed us to trade off area, as 
more varieties of T7 are needed, for speed since the XOR’ed result does not need to be 
debiased and higher throughput is possible. 

The total gate count for a random number generator based on the T7 design is as 
follows. Two AND gates, one inverter and one OR gate are used to implement each 
multiplexer. Thus a single randomness circuit requires four AND gates, four inverters 
and two OR gates. Since the design incorporates 15 instances of T7 and 14 XOR gates 
to collect the bits, the total number of gates is 60 AND gates, 60 inverters, 30 OR 
gates and 14 XOR gates. The small number of gates required to realize this random 
number generator makes the design suitable for applications that mandate strict power 
and area constraints. 



8 Conclusions 

In this paper, we have demonstrated that a practical random number generator can be 
built using standard digital gates and standard layout tools. The generator is stable 
even over large changes in operating voltage. This is a strong indication that such a 
generator will have good characteristics in extreme temperature conditions, such as 
may be found when a Smartcard is used at an outdoor Automatic Teller Machine 
(ATM) as well as resistance to attack by variation of voltage or temperature (side 
channel attacks). 
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The generator produces random bits by exploiting analog circuit artifacts found in 
common digital circuits. These artifacts are utilized by creating circuits that are noise 
sensitive, allowing naturally occurring semiconductor noise to determine the final 
output. Even though an individual circuit may not be perfectly unbiased, a relatively 
small number of similar circuits could be XOR’ed to produce a usable random result. 

Additionally, we used different varieties of gates (all of which are standard library 
components) to immunize the overall design against expected environmental and pro- 
cess variations. 



8.1 Recommendations for Future Work 

Future work will involve the testing of the circuit over temperature extremes. Like- 
wise, the circuit should be tested at greater switching frequencies to determine the 
maximum usable bit rate. Varying the duty cycle of the switch signal so that different 
“oscillation” and “resolution” times are applied to the circuit would also be interesting 
for all of the designs. Combining changes in temperature, voltage, and frequency may 
also yield interesting results. 

Further work using de-biased versions of each generator would help, but not prove 
the independence of the different varieties. It is possible that even a simple von- 
Neumann corrector applied to a single variety would produce acceptable results. How- 
ever, such a design may not have some of voltage immunity a combined design seems 
to have. 
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Appendix A: Tabulated DIEHARD Results 



Table 1. DIEHARD results for T1-T9 at 1.8 volts and room temperature 
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Table 2. DIEHARD results for T7 at room temperature and varying supply voltage 




Table 3. Von Neumann correction of variety 7 and 9 for T7 at 1.8 volts. Table shows the 
DIEHARD test results for biased and debiased sequence, and the reduction in the size of the 
debiased sequence after passing the haised sequence through the Von Neumann corrector 
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Abstract. A true random number generator (TRNG) usually consists 
of two components: an “unpredictable” source with high entropy, and a 
randomness extractor — a function which, when applied to the source, 
produces a result that is statistically close to the uniform distribution. 
When the output of a TRNG is used for cryptographic needs, it is pru- 
dent to assume that an adversary may have some (limited) influence on 
the distribution of the high-entropy source. In this work: 

1. We define a mathematical model for the adversary’s influence on the 
source. 

2. We show a simple and efficient randomness extractor and prove that 
it works for all sources of sufficiently high-entropy, even if individual 
bits in the source are correlated. 

3. Security is guaranteed even if an adversary has (bounded) influence 
on the source. 

Our approach is based on a related notion of “randomness extraction” 
which emerged in complexity theory. We stress that the statistical ran- 
domness of our extractor’s output is proven, and is not based on any 
unproven assumptions, such as the security of cryptographic hash func- 
tions. 

A sample implementation of our extractor and additional details can be 
found at a dedicated web page [Web]. 



1 Introduction 

1.1 General Setting 

It is well known that randomness is essential for cryptography. Cryptographic 
schemes are usually designed under the assumption of availability of an endless 
stream of unbiased and uncorrelated random bits. However, it is not easy to 
obtain such a stream. If not done properly, this may turn out to be the Achilles 
heel of an otherwise secure system (e.g., Goldberg and Wagner’s attack on the 
Netscape SSL implementation [GW96]). 

In this work we focus on generating a stream of truly random bits. This is the 
problem of constructing a true random number generator (TRNG). The usual 
way to construct such a generator consists of two components: 
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1 . The first component is a device that obtains some digital data that is unpre- 
dictable in the sense that it has high entropyd This data might come from 
various sources, such as hardware devices based on thermal noise or radioac- 
tive decay, a user’s keyboard typing pattern, or timing data from the hard 
disk or network. We stress that we only assume that this data has high en- 
tropy. In particular, we do not assume that it has some nice structure (such 
as independence between individual bits). We call the distribution that is 
the result of the first component the high-entropy source. 

2. The second component is a function, called here a randomness extractor, 
which is applied to the high-entropy source in order to obtain an output 
string that is shorter, but is random in the sense that it is distributed ac- 
cording to the uniform distribution (or a distribution that is statistically 
very close to the uniform distribution). 

Our focus is on the second component. The goal of this work is to construct 
a single extractor which can be used with all types of high-entropy sources, and 
that can be proven to work, even in a model that allows an adversary some 
control over the source. 

Running a TRNG in adversarial environments. The high entropy source used 
in a TRNG can usually be influenced by changes in the physical environment 
of the device. These changes can include changes in the temperature, changes 
in the voltage or frequency of the power supply, exposure to radiation, etc.. In 
addition to natural changes in the physical environment, if we are using the 
output of a TRNG for cryptographic purposes, it is prudent to assume that an 
adversary may be able to control at least some of these parameters. Of course, 
if the adversary can have enough control over the source to ensure that it has 
zero entropy then, regardless of the extractor function used, the TRNG will be 
completely insecure. However, a reasonable assumption is that the adversary has 
only partial control over the source in a way that he can influence the source’s 
output, but not remove its entropy completely. 



1.2 Our Results 

In this paper, we suggest a very general model which captures such adversarial 
changes in the environment and show how to design a randomness extractor that 
will be secure even under such attacks. 

In all previous designs we are aware of, either there is no mathematical treat- 
ment or the source of random noise is assumed to have a nice mathematical 
structure (such as independence between individual samples). As the nature of 
cryptanalytic attacks cannot be foreseen in advance, it is hard to be convinced 

^ Actually, the correct measure to consider here is not the standard Shannon entropy, 
but rather the measure of “Min-Entropy” (see Remark 1). In the remainder of the 
paper we will use the word “entropy” loosely as a measure of the amount of ran- 
domness in a probability distribution. 
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of the security of a TRNG based on a set of statistical tests that were performed 
on a prototype in ideal conditions. We also remark that it may be dangerous 
to assume that the source of randomness has a nice mathematical structure, 
especially if the environment in which the TRNG operates may be altered by an 
adversary. 

Our extractor is simple and efficient, and compares well with previous de- 
signs. It is based on pairwise-independent hash function [WG81].^ Our approach 
is inspired by a somewhat different notion of “randomness extractors” defined 
in complexity theory (see surveys [NTS99,Sha02] and Section 1.3). 

Our design works in two phases: 

Preprocessing: In this phase the manufacturer (or the user) chooses a string 
7T which we call a public parameter. This string is then hardwired into the 
implementation and need not be kept secret. The same string tt can be 
distributed in all copies of the randomness extractor device, and will be 
used whenever they are executed. (We discuss this in detail in Section 5). 
Runtime: In this phase the randomness extractor gets data from the high- 
entropy source and its output is a function of this data and the public pa- 
rameter 7T. 

The analysis guarantees that if tt is chosen appropriately in the preprocessing 
phase and the high-entropy source has sufficient entropy then the output of the 
TRNG is essentially uniformly distributed even when the environment in which 
the TRNG operates is altered by an adversary. This guarantee holds as long as 
the adversary has limited influence on the high-entropy source. 

In particular, we make no assumption on the structure of the high-entropy 
distribution except for the necessary assumption that it contains sufficient en- 
tropy. Existing designs of high-entropy sources seem to achieve this goal. 

1.3 Previous Works 

Randomness extractors used in practice. As far as we are aware, all extractors 
previously used in practice as a component in a TRNG, fall under the following 
two categories: 

Designs assuming mathematical structure. These are extractors that 
work under the assumption that the physical source has some “nice” math- 
ematical structure. 

An example of such an extractor is the von Neumann extractor [vN51], used 
in the design of the Intel TRNG [JK99]. On input a source Xi, . . . , A„ the 
von Neumann extractor considers successive pairs of the form X^i, X 2 i+i; for 
each pair, if X 2 i yf X 2 i+i then X 2 i is sent to the output, otherwise nothing 
it sent. The von Neumann extractor works if one assumes that the all bits 
in the source are independent and are identically distributed. That is, each 

^ Some choices of the parameters require use of f-wise independent hash functions for 
I >2. 
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bit in the source will be equal to 1 with the same probability p, and this will 
happen independently of the values of the other bits. However, it may fail if 
different bits are correlated or have different biases. 

Other constructions that are sometimes used have every bit of the output be 
XOR of bits in the source that are “far from each other” . Such constructions 
assume that these “far away” bits are independent. 

RFC 1750 [ErCS94] also suggests some heuristics such as applying a Fast 
Fourier Transform (FFT) or a compression function to the source. However, 
we are not aware of any analysis of the conditions on the source under which 
these heuristic will provide a uniform output. 

Applying a cryptographic hash function. Another common approach 
(e.g., [ErCS94], [Zim95]) is to extract the randomness by a applying a 
cryptographic hash function (or a block cipher) to the high-entropy source. 
The result is expected to be a true random (or at least pseMdo-random) 
output. As there is no mathematical guarantee of security, confidence that 
such constructions work comes from the extensive cryptanalytic research 
that has been done on these hash function. However, this research has 
mostly been concentrated on specific “pseudorandom” properties (e.g., 
collision-resistance) of these functions. It is not clear whether this research 
applies to the behavior of such hash functions on sources where the only 
guarantee is high entropy, especially when these sources may be influenced 
by an adversary that knows the exact hash function that is used. 

Randomness extractors in complexity theory. The problem of extracting 
randomness from high-entropy distributions is also considered in complex- 
ity theory (for surveys, see [NTS99,Sha02]). However, the model considered 
there allows the adversary to have full control over the source distribution. 
The sole restriction is that the source distribution has high entropy. One 
pays a heavy price for this generality: it is impossible to extract randomness 
by a deterministic randomness extractor.^ Consequently, this notion of ran- 
domness extractors (defined in [NZ96]) allows the extractor to also use few 
additional truly random bits. The rationale is that the extractor will output 
many more random bits than initially spent. While this concept proves to 
be very useful in many areas of computer science, it does not provide a way 
to generate truly random bits for cryptographic applications.^ 

Nevertheless, our solution uses techniques from this area. For the reader 
familiar with this area, we remark that our solution builds on observing 
that a weaker notion of security (the one described in this paper) can be 

® Consider even the simpler task of extracting a single bit. Every candidate randomness 
extractor E : {0,1}" — >■ {0,1} partitions {0,1}" into two sets Bo,Bi where Bi is 
the set of all strings mapped to i by E. Assnme w.l.o.g. that |Ro| > \Bi. Then the 
adversary can choose the source distribntion X to be the nniform distribution over 
Bq and thus, E{X) is always fixed as 0 and is not at all random. Note that if E{-) 
can be computed efficiently, then this distribution X can also be sampled efficiently. 

^ Some weaker notions of randomness extractors were proposed. These notions usn- 
ally suggest considering restricted classes of random sonrces. See [TV02] and the 
references there. 
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guaranteed even when the few additional random bits are chosen once and 
for all by the manufacturer and made public. 



1.4 Advantages and Disadvantages of Our Scheme 

The main advantage of our scheme is that it is proven to work for every high- 
entropy source, provided that the adversary has only limited control on the 
distribution of the source. By contrast, previous schemes are either known to fail 
for some very natural high-entropy sources (e.g., the von Neumann’s extractor), 
or lack a relevant formal analysis (see above). 



Efficiency. It is natural to measure the performance of a randomness extractor 
in terms of the cost per output bit. This measure depends on the following 
factors: 

1 . Cost: The speed and size of the hardware or software implementation of the 
extractor. 

2. Entropy rate: The amount of entropy contained in the source. 

3. Entropy loss: The difference between the amount of entropy that the high- 
entropy source contains and the number of bits extracted. 

Our design allows tuning the running time and entropy loss as a function of 
the expected entropy rate and the desired resiliency against adversarial effects 
on the source. This tuning helps to achieve good overall performance in different 
scenarios. We discuss specific scenarios below. 

In general, our approach is quite simple and efficient and is suitable for a 
hardware implementation. Its cost is comparable to that of cryptographic hash 
functions, and it can provably achieve low entropy loss and extract more than 
half of the entropy present in the source (by comparison, the von Neumann 
extractor extracts at most half of the entropy)®. 

Example: low entropy rate. For example, consider the case where the source 
is the typing patterns of a user. In this case the speed at which one can sample 
the high-entropy source is comparatively slow, and furthermore sampling the 
source may be expensive. It is thus crucial to minimize entropy loss and extract 
as much as possible from the entropy present in the source. Our design allows 
extracting 3/4 of the entropy in the source at a slight cost to the running time. 
In this case, the running time is less significant as the bottleneck is the sampling 
speed from the random source. 

Example: high entropy rate. Consider the case where the source is sampling 
of thermal noise. Now the running time is important and we can tune our design 
to work faster at the cost of higher entropy loss. 

The existence of a formal proof of security can be helpful when optimizing the 
implementation. Our proof shows that any implementation of “universal hash 

® The basic von Neumann extractor can be extended to extract more bits at some cost 
to the algorithm’s efficiency [Per92]. 
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functions” (or ‘T-wise independent hash functions”) suffices for our random- 
ness extractor. Thus, a designer can choose the most efficient implementation 
he finds and optimize it to suit his particular architecture. This is contrast to 
cryptographic hash functions, which do not have a proof of security and where 
the effect of changes (e.g., removing a round) is unknown, and thus such opti- 
mizations are not recommended in practice. 



A public parameter. One disadvantage of our scheme is the fact that it uses 
a public parameter.^ The security of the scheme is proven under the assumption 
that the parameter is chosen at random. This parameter needs to be chosen only 
once and the resulting scheme will be secure with extremely high probability. 

We stress that we do not assume that this parameter is kept secret. More- 
over, this parameter can be chosen once and for all by the manufacturer and 
hardwired into all copies of the device. We also do not assume that the distri- 
bution of the high-entropy source is completely independent from the choice of 
this parameter — our model allows this distribution to be partially controlled 
by a computationally-unbounded adversary that knows the public parameter. 

Note that a public parameter is necessary to obtain the security properties 
that we require. 



2 The Formal Model 

2.1 Preliminaries 

Min-Entropy. The min-entropy of the source X, denoted by min-Ent(A), the 
maximal number k such that for every x G X, Pr[A = x] < 2~^. 

Remark 1. Min-entropy is a stricter notion than the standard (Shannon) en- 
tropy, in the sense that the min-entropy of X is always smaller than or equal to 
the Shannon entropy of X. 

It is easy to see that it is impossible to extract m bits from a distribution X 
with min-Ent(A) < to — 1. This is because such a distribution gives probability 
at least to some element x. It follows that for any candidate extractor 

function E : {0,1}” — >■ {0,1}’” the element y = E{x) has probability at least 
2-(m-i) E{X) is far from being uniformly distributed. 

We conclude that having min-entropy larger than to is a necessary condition 
for randomness extraction. In this paper we show that having min-entropy k 
slightly larger than to is a sufficient condition. 

Statistical Distance. We use dist(A, Y) to denote the statistical distance between 
X and Y that is: \ | Pr[A = a] — Pr[P = a]|. We say that X is e-close to Y 

if dist(A, Y) < e. 

® The description of many hash functions and block ciphers includes various semi- 
arbitrary constants; arguably these can also be considered public parameters. 
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Table 1. List of parameters 



n 


The length (in bits) of a sample from the high-entropy source. 


k 


The min-entropy of the high-entropy source. 


~T 


The adversary can alter the environment in at most 2* different ways 


m 


The length (in bits) of the output of the randomness extractor. 


e 


The statistical distance between the uniform distribution 
and the output of the randomness extractor. 



Notation for probability distributions. We denote by Um the uniform distribution 
on strings of length m. If X is a distribution then hy x X we mean that x is 
chosen at random according to the distribution X. If X is a set then we mean 
that X is chosen according to the uniform distribution on the set X. If X is a 
distribution and /(•) is a function, then we denote by f{X) the random variable 
that is the result of choosing x X and computing f{x). 

2.2 The Parameters 

The parameters for our design are listed in Table 1. We think of samples from 
the source as coming in blocks of n bits. 

The goal is to design an extractor that, given an adversarially-chosen n-bit 
source with k bits of entropy, is resilient against as much adversary influence 
as possible (i.e., maximize t), while extracting as many random bits as possible 
(i.e, maximize m), with negligible statistical distance e. In Section 2.5 we give a 
sample setting of the parameters. 

2.3 Definition of Security 

Definition 1 (Extractor). An extractor is a function E : {0,1}" x S' — >■ 
{0, 1}™ for some set S. 

Denote by E'^(-) = the one-input function that is the result of fixing 

the parameter tt to the extractor E. We would like the output E{X, tt) = E'^{X) 
to be (close to) uniformly distributed, where X G (0, 1}" is the output of the 
high-entropy source and tt G S is the public parameter. 



Defining Security. We consider the following ideal setting: 

1. An adversary chooses 2* distributions ,I? 2 * over {0,1}", such that 

min-Ent(2?i) > fc for alH = 1, . . . , 2*. 

2. A public parameter tt is chosen at random and independently of the choices 
of A. 

3. The adversary is given tt, and selects zG {!,..., 2*}. 

4. The user computes E'^(X), where X is drawn from 
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Definition 2 (t-resilient extractor). Given n,k,m,e and t, an extractor E 
is t-resilient if, in the above setting, with probability 1 — e over the choice of the 
public parameter the statistical distance between E'^{X) and Um is at most e. 

Interpretation. The above ideal setting is intended to capture security in the 
following scenario. A manufacturer designs a noise generating device whose out- 
put is a random variable X. Ideally, we would like the adversary not to be 
able to influence the distribution of X at all. However, in a realistic setting 
the adversary has some control over the environment in which the device oper- 
ates (temperature, voltage, frequency, timing, etc.), and it is possible that that 
changes in this environment affect the distribution of X. We assume that the 
adversary can control at most t boolean properties of the environment, and can 
thus create at most 2* different environments. The user observes the value of X 
which, conditioned on the choice of environment being i, is distributed as T>i. 
The definition of security guarantees that the output of the extractor is close to 
uniformly distributed. 

In fact, the security definition (which our construction fulfills) may be 
stronger than necessary in several senses: 

• We do not assume any computational bound on the adversary. 

• We do not assume that the user knows which environment i was chosen. 

• More fundamentally, we do not require either the user nor the manufacturer 
need to knows which properties of the environment are controllable by the 
adversary. The only limitation is that the adversary can control at most t 
(boolean) properties, and that the source entropy is at least k. 

• We allow adversarial choice of all the source distribution for each of the 2* en- 
vironment settings. Thus, even for “normal” environment settings, the source 
may behave in the worst possible way subject to the above requirements. By 
contrast, in the real world the source distributions would be determined by 
the manufacturer, presumably in the most favorable way. 

• The behavior of the source may change arbitrarily for different environments 
(i.e., the distributions T>i,'Dj for i ^ j need not be related in any way). In 
the real world, many properties of the source would persist for all but the 
most extreme environment settings. 



Remark 2. One can make a stricter security requirement by allowing the adver- 
sary to choose i not only as a single value, but also as a random variable with 
an arbitrary distribution over {1, . . . ,2*}. The two definitions are equivalent. 



Remark 3. Many applications require a long stream of output bits. In such cases, 
our extractor can simply be applied to successive blocks of inputs, always with 
the same fixed public parameter. The security is guaranteed as long as each input 
block contains k bits of conditional min- entropy (defined analogously to condi- 
tional entropy), conditioned on all previous blocks. Of course, this still requires 
that the conditional distribution of every input block is one of T>i , . . . , T> 2 t . 
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2.4 Our Result 

Our main result is an efficient design for a t-resilient extractor for a wide range 
of the parameters. We have the following theorem: 

Theorem 1. For every n, k, m and e there is a t-resilient extractor with a public 
parameter of length 2n such that ^ 

k — m „ , , N 

t=— 21og(l/e)-l 

We can increase t at the cost of an increase in the running time and the 
length of the public parameter tt, to obtain: 

Theorem 2. For every n, k, m and e and ^ > 2 there is a t-resilient extractor 
with a public parameter of length in such that 

a 

t = -{k — m — 21og(l/e) — log£+ 2) — m — 2 — log(l/e) 

We explain our construction in Section 3 and prove its security in Ap- 
pendix A. In Section 4 we present potential implementations. 

Note that in the above theorems, t does not depend on n. In other words, 
the resiliency t depends only on the amount of of entropy loss (i.e., k — m), the 
statistical distance (i.e., e) that we are willing to accept and the parameter i. 

2.5 Sample Settings of Parameters 

As the theorems involve many parameters we give concrete examples for natural 
choices. In the following examples we will consider extracting m = 256 bits which 
are e = 2“^®-close to uniform from a source containing k = 512 bits of entropy. 
The choice of n (the length of the source) should depend on the expected quality 
of the random sample. For example, if the source is the typing pattern of a user 
then its entropy rate is low and we may need to set n « 2500 in order to have 512 
bits of entropy, while for dedicated noise-generation hardware we may assume 
that the entropy rate is very high and set n = 768. 

Using Theorem 1 we get t = 57 for k = 512. Using the less efficient design 
of Theorem 2 we can improve both entropy loss and security: choosing ^ = 16 
we get t = 667 and can even reduce k to k = 448. These numbers are just 
an illustration and different tradeoffs between performance, entropy loss and 
guarantee of security can be made. 

3 Our Design 

In this section, we present our construction which is based on the notion of 
“f-wise independent hash functions”. We formally define this notion and de- 
scribe our construction in these term in Section 3.1 and prove correctness in 
Appendix A. We discuss implementation of this construction in Section 4. 

^ In fact, the constructions described in Section 4 have a shorter public parameter, of 
length n and n -|- m — 1. 
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3.1 Randomness Extractors from €-Wise Independence 

We start by recalling the notion of f-wise independence. 

Definition 3 (f-wise independence). A collection Zi,--- , Zn of random 
variables is called £-wise independent if for every i\,--- ,ii G {!)••• 
random variables , • • • , Zi^ are independent. 

A very useful tool is the notion of Awise independent families of hash func- 
tions. Intuitively, such functions have some properties of random functions even 
though they’re much less random. 

Definition 4 (f-wise independent families of hash functions). Given a 
collection H = {hs}seS of functions hs : {0,1}" — >■ {0,1}’", we consider the 
probability space of choosing s Gfi 5*. For every x G {0, 1}" we define the random 
variable Rx = hs{x) (Note that x is fixed and s is chosen at random). We say 
that H is an l-wise independent family of hash functions if: 

— For every x, Rx is uniformly distributed in {0, 1}*". 

~ The random variables {Rx}x£{o,i}'^ o,re i-wise independent. 

The usefulness of this definition stems from the fact that there are such 
families which are relatively small (of size 2^") and for which hg{x) can be 
efficiently computed given s and x. 

In these terms, our construction is described as follows: randomly choose 
s S (this is the “public parameter”), and let the randomness extractor be 
simply 



E{x) = hs{x) 

In Appendix A we show that for appropriate parameters, this yields a t-resilient 
extractor. The following section describes concrete constructions. 



4 Implementation 

This section describes several extractor implementations based on known con- 
structions of pairwise-independent hash functions [CW79,WC81]. Recall that an 
implementation of our randomness extractor is constructed has two phases: 

— Preprocessing: Choosing the public parameter tt. 

— Runtime: Running E'^(x) on a given string x. 

As the first phase is done once and for all at a preprocessing phase, we 
focus on optimizing the resources used by the second phase. Also, the known 
implementations of Awise independent hash functions (for large £) have higher 
implementation cost than 2-wise independent hash functions; we thus consider 
only implementations of 2-wise independent hash functions. 
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4.1 Linear Functions 

Theorem 3. Let GF(2”) he the field with 2” elements, and S = {{a,b)\a,b G 
GF(2")}. For s = (a, b) and m < n define: hs{x) = {a ■ x + (i-e., the 

first m bits of a ■ x + b where arithmetic operations are in GF(2") ). Then the 
family = {hs}s^s is a 2-wise independent family of hash functions. 

The field GF(2") can be realized as as the field of polynomials over GF{2) 
modulo an irredicible polynomial of degree n. When the field element are repre- 
sented as coefficient vectors, addition is the bitwise XOR operation while mul- 
tiplication requires a modular reduction operation that is easily implemented in 
hardware for small n, but grows expensive for larger ones and is not well suited 
for software implementations. 

In Appendix A we show that for every random variable X with min-Ent(A) > 

k, for an appropriate choice of m we have that for most pairs s = (a, b), hs{X) 
is close to uniform. 

We have s = (a,b), hs{x) = (ax)i^... ,m- Define ga{x) = (ax)i^... 

For every fixed pair (a, b), bi^... is constant. Thus, is close to uniform 

if and only if ga{X) is close to uniform. This yields the following extractor for 
any m < n, where the relation to t and e follows from Theorem 1. 

— Preprocessing: choose some irreducible polynomial of degree n. Choose a 
string a G {0, 1}” at random and set tt = a. 

— Runtime: E^{x) = {a ■ x)i^... using multiplication in GF(2”). 

Remark J^. A family of Gwise independent hash functions can be constructed in 
a similar manner by setting s = (oi, • • • ,ae) and hs{x) = X)i<i</ © ‘ 

4.2 Binary Toeplitz Matrices 

Theorem 4. Let T he a finite field and let n',m' be integers, m' < n' . Let 
S = ForsGS and x G define hs{x) G by (hfix) G = 

'Then the family = {hs}seS is a 2-wise independent family 

of hash functions. 

The function hg{x) can be thought of as multiplication of an m x n Toeplitz 
matrix (whose diagonals are given by s) by the vector x. Alternatively, it can 
be considered a convolution of x and y. For T = GF(2), v! = n, m' = m, 
ha{x) we get the following extractor any m < n, (as before, t and e follows from 
Corollary 1). 

— Preprocessing: Choose a string tt G {0, at random. 

— Runtime: m bits such that the i-th bit is ®”^Q(xi A Sj+j). 

For reasonable n and m, this can be implemented very efficiently in hardware, 
though in software the bit operations are somewhat inconvenient. To evaluate 
this, we tested a software implementation of this construction, written in plain 
C. For the parameters n = 768, m = 256 of Section 2.5, this implementation 
had a throughput of 36Mbit/sec (measured at the input) when executed on a 

l. 7GHz Pentium Xeon processor (cf. [Web]). 
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4.3 Toeplitz Matrices over GF{2^) 

Since Theorem 4 applies to any finite field, we may benefit from using GF{2^) 
for k > 2. For example, consider F = F'[x\/{x^ + ra; + 1) = GF(2^®) where 
F' = GF{2)[x']/{x'^+x'^+x'^+x'+l) ^ GF{2^) andr = x' G F' (both modulus 
polynomials are irreducible over the respective fields; the latter is taken from 
the Rijndael cipher.) Using this field, a direct implementation of the convolution 
performs just n/16 field multiplications for every 16 output bits. 

The fields F, F' are very suitable for software implementation, as follows. For 
a,h & F' \ {0}, multiplication in F' can be realized as ab = exp(log,^ a + log^ b), 
where 2: is some generator of F'; exp^ and log^, can be implemented via two small 
lookup tables [Gla02] . Multiplication in F can then realized by 5 multiplications 
in F'-. 



{a\x + oo) • (61 + xbo) = aobo — c + {aobi + aibo — rc)x over F 

where c = oi&i and r = x' & F' is a constant (02^ as a bit vector). All additions 
can be done as bitwise XOR, so overall each multiplication over F requires 4 
XOR operations, 5 integer additions, a few shifts, 5 • 3 table lookups (into the LI 
cache) and 5 • 2 tests whether a, 6 = 0. ® A straightforward C implementation of 
the above description achieved a throughput of 56Mbit/sec in the same settings 
as above (cf. [Web]). 

4.4 Randomness Tests 

We subjected the above implementations to the DIEHARD suite of statistical 
tests [Mar95j. For the seed, we used 1023 truly random bit generated by the 
/dev/rcuidom TRNG of Linux 2.4.20. For the source, we generated a 90MB file 
of English text by retrieving a large number of e-texts from Project Gutten- 
berg, discarding the first 1000 lines of each file (this contains a common header) 
and eliminating all whitespaces. Thus the source data included the texts of Moby 
Dick and Frankenstein, the complete works of Shakespeare, and other such “ran- 
dom” data. We then executed the extractor on successive blocks of n bits, with 
n = 768 and m = 256, to get 30MB of output bits. The DIEHARD tests did not 
detect any anomaly in this output.® The test report is available at [Web]. 

5 Conclusions 

In this work, we provide an extractor function that is proven to work in a model 
that allows for some adversarial influence on the high-entropy source. The most 
obvious question is whether the real world conditions satisfy the assumptions 

® The multiplication of c by the constant r can be computed by a single lookup; this 
eliminates 1 addition, 2 lookups and 2 tests. 

® In fact, on first attempt DIEHARD did report certain anomalies in one test. A 
careful inspection revealed that our source data accidentally included several nearly- 
identical versions of some literary works. 




178 



B. Barak, R. Shaltiel, and E. Tromer 



of our model. For example, suppose that a manufacturer constructs a device 
that outputs a distribution with min entropy at least k in the “benign” (i.e., 
non-adversarial) settings. Suppose now that he wants to apply our extractor to 
the output of this device, in an environment that may be somewhat influenced 
by the adversary. 

One concern the manufacturer may have is how to ensure that under all pos- 
sible adversarial influences, the entropy of the source will remain sufficient? This 
is indeed a valid concern, but if this is not met then the result will be insecure re- 
gardless of the extractor function used (since the adversary will be able to reduce 
the source’s entropy, and no extractor function can add entropy). Therefore, it 
is the responsibility of the manufacturer to make sure that the device satisfies 
this condition. When this is fulfilled, our construction gives explicit guarantees 
on the quality of the extractor’s output. 

We stress that while the manufacturer still needs to carefully design the high- 
entropy source to be as independent as possible from environmental influence, 
the overall scheme will work even if design is not perfect and the adversary can 
affect the source in unpredictable ways, subject to the constraints assumed in 
our security model. 

Acknowledgements. We thank Moni Naor and Adi Shamir for helpful discus- 
sions, and the anonymous referees for their constructive comments. 
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A Proof of the Main Theorems 



We begin by showing that if H = {/isjsgs is an ^-wise independent family of 
hash function for sufficiently large £, then for any fixed distribution X with 
sufficiently large min-Ent(AT), for most choices of s G S', hs{X) is close to the 
uniform distribution. The interpretation is that for most choices of s, hs is a 
good randomness extractor for X. This is formally stated in the next lemma. 

Lemma 1. Let X be an n-hit random variable with min-Ent(Ai) > k. Let H = 
{hs}seS be « family of l-wise independent hash functions from n bits to m bits, 
i >2. For at least o 1 — fraction of s € S, hs{X) is e-close to uniform for 



i 

u= -{k — m — 21og(l/e) — \og£ -b 2) — m — 2 

The proof of uses standard arguments on £-wise independent hash functions 
(this technique was used in a very related context in [TV02]). We will need the 
following tail inequality for ^-wise independent distributions. 

Theorem 5. [BR94] Let Ai, - ■ ■ ,An be t-wise independent random variables in 
the interval [0, 1]. Let A = A cttid yi = E(A) and <5 < 1. Then, 

f p \U/2J 

Pr[|A - t\> Sp] < Cl 
Where ce < 3 and cp <\ for £ > 8. 

Proof (of Lemma 1). For x G {0,1}" let Px = Pr[X = x\. We consider the 
probability space of choosing s G/j S. For every x G (0, 1}” and y G (0, 1}™ we 
define the following random variable: 

7 _ j Px h{x) = y 

} 0 otherwise 
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We also define . Recall that for every x, hg{x) is uniformly dis- 
tributed, and therefore for every x, y, E(Za, j,) = Let Zy = i}" ^^,y 

and Ay = i}" K follows that for every y G {0, 1}™, E(Zj^) = 2“™ 

and thus, E(^j,) = Note that for every x,y, A^^y lies in the interval 

[0, 1] and that for every y, the variables A^^y are t'-wise independent. Applying 
Theorem 5 we obtain that for every y and i5 < 1 

/ / \ 

Pr[|A,-2'=-|>52'=-n<c,(^^^j 

Substituting Zy for Ay and choosing <5 = 2e, we get that for every e < 1/2 

Pt||Z, - 2-”| > 2e2-“] < « 



By a union bound, it follows that with probability 1 — 2™C£ ^ > 

1 — 2““ over s Gr S, for all y G {0, 1}™ \Zy — 2“’”| < 2e2“’". We now argue 
that for such s, hs{X) is e-close to uniform. Observe that Zy is the probability 
that hs{X) = y (we think of s as fixed, with x chosen according to X). The 
statistical distance between hs{X) and the uniform distribution is given by: 

1/2 \Zy <1/2 2e2-™<e 

yG{o,i}™ ye{o,i}™ 



When applying Lemma 1 with £ = 2, one must set m < k/2. This can be 
avoided as shown by the following lemma, for the special case 1=2. 

Lemma 2. Let X be an n-bit random variable with min-Ent(A) > k. Let H = 
{hs}seS be o, family of 2-wise independent hash funetions from n bits to m. For 
at least a 1 — 2~“ fraction of s € S, hg{X) is e-close to uniform for 

«=^-log(l/e)-l 

The proof is based on a technique introduced in [ILL89], and will appear in 
the full version of this paper. The next corollary follows easily, by a union bound. 

Corollary 1. Let X\y ■ ■ ,^ 2 * be random variables with values in {0,1}” such 
that for each 1 < i < 2*, min-Ent(Ai) > k. Let H and u be as in Lemma 1 (or 
Lemma 2). For at least a 1 — 2*““ fraction of s € S it holds that for all i, hs{Xi) 
is e-close to uniform. 

Theorems 1 and 2 follow by setting u = log{l/e). 
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Abstract. A hardware random number generator was described at 
CHES 2002 in [Tka03]. In this paper, we analyze its method of generat- 
ing randomness and, as a consequence of the analysis, we describe how, 
in principle, an attack on the generator can be executed. 
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1 Introduction 

Both designs for hardware random number generators and the evaluation of 
hardware random number generators have not been treated often in publications. 
This means a serious contrast between - on the one hand ~ the importance of 
hardware random number generation and their evaluation for the security of 
applications and - on the other hand ~ the little attention this topic has found 
in the published literature. 

One case where the absence of a suitable physical random number generator 
received considerable public attention were the defects found in the random 
number generation for SSL implemented in an early version of the Netscape 
browser. The time of the day used as the only source of true randomness did 
not provide enough entropy. This lack of entropy could be used for a spectacular 
attack [GW96]. 

Physical random number generators deriving their randomness from a phys- 
ical random process, are also called true random number generators (TRNGs). 
TRNGS have to be distinguished from pseudo random number generators 
(PRNGs). PRNGs derive their output algorithmically from a secret initial state. 
The unpredictability of PRNGs relies on the computational infeasibilty of trying 
all possible initial states, and on some assumptions on the algorithm used. 

We will see that the hardware random number generator described at GHES 
2002 in [Tka03] is a combination of TRNG and PRNG elements. Therefore we 
just call it the RNG (random number generator) in this paper. 

We will show that the RNG in some cases produces very little entropy, so 
that its output can be predicted. This is in contrast with one of the design 
requirements for the generator cited in [Tka03]. 

This paper provides theoretical analysis based on the properties of the RNG 
described in [Tka03]; no experiments on a real chip were made. 
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2 A Hardware Random Number Generator 

The RNG described in [Tka03] uses two free running oscillators, implemented 
as ring oscillators, to clock two deterministic finite state machines. 

One of the finite state machines is a binary linear feedback shift register 
(LFSR) of length 47. The feedback polynomial of the LFSR is primitive. 

The other finite state machine is a one dimensional binary cellular automaton 
(CA) with a neighbourhood of 3. The CA consists of 37 cells. All cells except 
one follow one rule to derive their next state: The new value is the XOR of the 
old values of the two neighbouring cells. Only cell 28 is subject to another, albeit 
similar, rule: Its new value is the XOR sum of its old value and the old values 
of its two neighbours. For the two cells at the border of the CA, null boundary 
conditions are used; that is those neighbouring cells which are required by the 
CA rules, but which are beyond the limits of the CA, are assumed to have the 
fixed value 0. If started from a not all zeros state, this CA has a cycle length of 
237 _ p 

When the RNG has to produce a random number, 32 bits of the LFSR and 
32 bits of the CA are taken from fixed positions of those finite state machines. 
The 32 bits from the LFSR are XORed with the 32 bits from the CA in order 
to receive the 32 bit output word. 



3 Where Does the Randomness in the RNG Come From? 

The main elements of the RNG are two free running oscillators, a linear feedback 
shift register, and a cellular automaton. Free running oscillators can be a basis 
of TRNGs, whereas linear feedback shift registers and cellular automata, which 
are deterministic, are frequently used in the construction of PRNGs. In order to 
get a clear understanding of the origin of randomness in the RNG, the TRNG 
parts and the PRNG parts have to be separated mentally. 

The linear feedback shift register used in the RNG can be seen as a special 
counter with a period length of 2^3 _ counter states are not represented as 

the familiar binary numbers, but are encoded as subsequent shift register states. 
Clearly the conversion between the representation of a counter state as a binary 
number and a shift register state is completely deterministic. 

Analogously, the cellular automaton used in the RNG can be seen as a special 
counter with a period length of 237 _ Here, the counter state is represented 
as a cellular automaton state, but again, the conversion between these cellular 
automaton states and the familiar binary numbers is completely deterministic. 

Hence, the only source of entropy in the RNG are the initial states of the 
registers in the linear feedback shift register and the cellular automaton, and the 
number of clocks that occurred for the linear feedback shift register and for the 
cellular automaton. The number of clocks for the linear feedback shift register 
is only relevant modulo 2^3 _ xhe number of clocks for the cellular automaton 
modulo 237 _ X 
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4 How Random Is This? 

As we have identified the sources of entropy in the RNG, the question arises how 
much entropy they provide. 

The first source of randomness are the undetermined initial states of the reg- 
isters. Even if each register assumed the 0 and 1 state with probability 1/2 each 
time the RNG is initialized, and even if there were no dependencies between the 
states of the registers at different initializations, this would not help the RNG 
much on the long run, because a good TRNG has to must produce continually 
new entropy as it runs, and not rely on an initial stock of entropy. However, 
the initial states of flip flops turn out to be no reliable source of entropy. Due 
to manufactoring variations, they are not completely symmetrical and as a con- 
sequence, most flip flops have an initial state which they initially take with a 
probability close to 1. 

The other sources of randomness are the number of clocks occurring for the 
cellular automaton (modulo 2^^ — 1) and for the linear feedback shift register 
(modulo 2^^ — 1) since the initialization. 

When an attacker Alice knows the frequencies of the free running oscillators 
clocking the LFSR and the GA only with limited precision, the RNG becomes 
completely unpredictable after a sufficiently long waiting time. 

Let us assume Alice knows the frequencies of the free running oscillators with 
a precision of 10 percent, and that she also knows the initial state of the GA. 
Roughly speaking she looses all information about the state of the GA after 
about 10 • (2^^ — 1) « 10^^ clocks of the GA. Even if the GA were clocked with 
1 GHz, this would mean a waiting time of about 22 minutes. And in order to 
achieve completely independent GA states, one would also need waiting times of 
about 22 minutes between each 32 bit block of random values generated. 

In [Tka03], too, a minimum sampling period for the subsequent generation 
of 32 bit blocks of random values is given. It is considerably smaller than 10^^, 
namely 86 cyles of the oscillator clocking the LFSR. Subsequently, we shall study 
the predictability of the RNG when this minimum sampling period is used. 



5 How to Predict the RNG Bits 

5.1 How Well Does an Attacker Know the Frequencies of the Free 
Running Clocking Oscillators? 

Evidently, the better the attacker Alice knows the frequencies of the free running 
oscillators clocking the LFSR and the GA, the better she can predict the numbers 
of clocks occurring for the LSFR and the GA. 

The knowledge of these frequencies depends heavily on the circumstances 
of the attack. The main environmental parameters influencing the frequencies 
of free running oscillators are the temperature and the supply voltage. Some- 
times these parameters are difficult to predict for Alice. In other applications, 
she may know these parameters precisely , or may even choose them. For ex- 
ample, professionally run trust centers tend to have their computers in stable 
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air conditioned environments without much variation in temperature or supply 
voltage. On the other hand, smartcards will encounter enormous variations in 
environmental conditions, but when the user of a smartcard wants to attack 
the physical random number generator of the smartcard, she may choose the 
environment temperature and the supply voltage at her will. 

But even when Alice knows the operating conditions of the oscillators 
perfectly, their frequencies cannot be predicted perfectly, because of non- 
deterministic effects in the oscillators. For example, there are physically un- 
avoidable noise voltages in the transistors of the oscillator. This noise influences 
to some degree the exact moments when the transistors switch. 

In order to infer the clocking frequencies from the environmental data, the 
attacker can either perform experiments on a chip with the hardware random 
number generator, or she must know the design details of the oscillators. The 
manufacturer of the chip of course knows these design details, and the outcome 
of a good hardware random number generator should be unpredictable even for 
the manufacturer of the hardware. 

We will not elaborate a statistical model for the attacker’s knowledge of the 
clocking frequencies, because this is not crucial for the attack. As we will see 
later on, we can easily increase the number of tries to guess the number of clocks 
occurring for the LFSR and the CA if our assummptions about the knowledge 
of the clocking frequencies are wrong. 

In order to make the attack as efficient as possible, we concentrate on the 
case where the RNG is subsequently sampled as fast as possible. In [Tka03], 
the minimum time between the sampling of two output words is defined by the 
requirement that both state machines (CA and LFSR) clock at least twice their 
length. 

For our attack we also need to assume an upper bound on the ratio of the 
frequency of the faster free running oscillator and the slower oscillator. This is not 
an arbitrary restriction, but performance and power consumption considerations 
make it advisable to choose the clocking frequencies of both oscillators in the 
same order of magnitude. If a very fast oscillator is used for one finite state 
machine and a slow oscillator for the other, one gets a RNG with a low data 
rate but high power consumption. The low data rate is caused by the slow 
oscillator and the design rule that the finite state machine must be clocked 
twice its length before it can be sampled again. The high power consumption is 
due to the fast oscillator, because power consumption and clocking frequency of 
the state machine are roughly proportional. Subsequently, we assume an upper 
bound of 3 for the frequency ratio of the oscillators. If the frequency ratio were 
higher, more guesses would be needed to find the correct number of clocks. A 
lower bound would speed up the attack. 

In the scenario where the attacker defines the environmental conditions, she 
should be able to know the clock frequencies with a precision of 1 percent. If 
the attacker does not control the environmental conditions, she might be able 
to determine the clocking frequencies with a precision of 10 percent. 
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5.2 Guessing the Number of Clocks 

Subsequently, we will consider three 32 bit words sampled from the RNG at 
top speed. This means that we have to consider the number of clocks occurring 
between the first and second sample, and between the second and third sample, 
for each oscillator. We use and to denote the clock frequencies of 

the CA and the LFSR, respectively. 



Case A. Here we consider the case /lpsj| < /CA- case, the maximum 

sampling frequency is limited by the rule that the LFSR must clock twice its 
length before it can be sampled again. This means that at top speed the LFSR 
clocks 86 times, or, since Alice knows the frequency only with a precision of one 
percent in the scenario of an environment controlled by her, there may also occur 
85 or 87 clocks. By our bound of 3 on the frequency ratio between the oscillators, 
the number of CA clocks is bounded by 258. With an error of one percent in 
Alices knowledge of the frequency, this leads to at most 7 possible numbers of CA 
clocks. Analogously, with a 10 percent insecurity for the frequencies, there are 
19 possibilities for the number of LFSR clocks, and at most 53 for the number 
of CA clocks. 



Case B. Here we consider the case > /cA- have to distinguish two 

subcases. 



Case Bl. When 37/Fpgj| < 43/^;^ holds, that is /lfSR slightly larger 

than the maximum sampling rate allowed for the RNG is still determined 

by the LFSR frequency. As in case A, we have 3 possibilities for the number of 
LFSR clocks, if we know the frequencies with a precision of 1 percent. Since the 
CA is clocked at a lower rate, at most 3 numbers of CA clocks are possible. For 
the scenario of a 10 percent precision in the knowledge of the frequencies, we get 
19 possible numbers of LFSR clocks and also 19 possible numbers of CA clocks. 



Case B2. When 37/FFgj| > 43/^;^ holds, the maximum sampling rate is 
determined by the CA. If the attacker knows the frequencies with a precision 
of 1 percent, this leads to 3 possible numbers of CA clocks, and to at most 7 
possible numbers of LFSR clocks. With a 10 percent accuracy in the frequencies, 
there are 19 possible numbers of CA clocks, and 46 possible numbers of LFSR 
clocks. 

In the case of frequencies known with a precision of 1 percent, the worst 
case is that we have a total of 21 possibilities for the numbers of clocks for both 
finite state machines. With a precsision of 10 percent, the worst case are 1007 
possibilities. 

Since we need the numbers of clocks occurring between the first and second 
sample, and between the second and third sample, we get a total of 441 cases 
(1 percent case) or 1014049 (10 percent case). This numbers are just a very 
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coarse upper bound on the number of cases to consider, because the numbers 
of clocks between the different samples are strongly dependend. If, for example, 
the attacker knows that the number of LFSR clocks is between 200 and 240, she 
should not begin with 200 for the number of clocks between the first and second 
sample, and 240 for the number of clocks between the second and third sample. 
This combination is quite improbable to occur, because the frequencies of the 
oscillators do not change suddenly from very low to very high. Instead, the best 
strategy for the attacker is to assume that the clock frequency changed only very 
little from the second to the third sample. So, combinations of numbers of clocks 
with small differences should be tried first. 

5.3 Determining the Internal States of the CA and the LFSR 

In this section we assume that we have correctly guessed the number of clocks 
of both the CA and the LFSR occurring between three top speed samplings of 
the RNG. We try to find out the internal states of the machines from the three 
32 bit output words. 

Since we assume that we know the number of clocks occurring we could 
try a brute force approach. The almost possible intial states make this 

quite impractical. An efficient solution must rely on the properties of the state 
machines. 

A closer inspection of the two finite state machines makes the solution very 
easy: both are linear in GF(2). The function combining bits from each finite 
state machine to compute the RNG output is also linear. We have to solve a 
system of 96 linear equations in order to determine the 80 bits of the states of 
the CA and the LFSR. 

The fact that the number of equations exceeds the number of variables by 
16, helps to eliminate wrong guesses of the number of clocks of the finite state 
machines. With a probability of 1 — 1/2^®, a wrong guess results in a system of 
linear equations without a solution. 

When one tries to write down the linear equations, one encounters a minor 
problem: [Tka03] does not specify which 32 bits from each finite state machine 
are used and how they are permuted. An attacker could reverse engineer the 
chip in order to receive this information. The information is also known to the 
manufacturer of the chip. And, as already mentioned above, the output of a 
good RNG should be unpredictable even for the manufacturer of the chip. In 
our further analysis, we assume that the attacker knows which bits of the finite 
state machines are used for the output, and how they are permuted. 

To determine the time required to find the solution a system of equations as 
described above, fixed random choices of bits and fixed random permutations 
were used. Clearly these choices do not have essential influence on the complexity 
of solving the system of linear equations. 

On a 400 MHz Pentium II, Mathematica 4.2 solved the system of equations 
in 0.06 seconds using the function LinearSolve[] . This time can definitively be 
improved significantly by using a faster PC or dedicated software for solving 
systems of linear equations over GF(2). But even when it takes 0.06 seconds to 
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solve the system of linear equations, in the scenario of clock frequencies known 
with a precision of 1 percent, all 441 possible systems can be solved in 27 seconds. 
In the 10 percent scenario, it takes 17 hours to try all 1014049 possibilities. 
But as pointed out above, many combinations of numbers of clocks are quite 
improbable, so a good strategy for ordering the tries will enable the attacker to 
find the solution much faster. If the attacker tries all 1014049 possibilties, she 
will find about 15 solutions not corresponding to the internal states of the finite 
machines. The reason is that wrong guesses lead to a solvable system of linear 
equations with a probabilty of 1/2^®. The attacker should prefer solutions for 
which the differences in the number of clocks are small. 

5.4 Predicting Bits 

Once the attacker knows the internal states of the finite machines, she is well 
off. In order to predict the next output bits, she just has to guess the numbers 
of clocks of the finite state machines until the time the next random sample was 
generated. We have seen above that the number of cases to consider is quite 
small. But now the task of finding the right number of clocks is easier in two 
ways, compared to finding the correct number of clocks to determine the state. 
To be able to get the equations, Alice had to guess the right number of clocks 
for two samples. Here the number of clocks for one sample is sufficient. Alice 
can also profit from knowledge aquired when finding out the internal states of 
the finite machines. She may have started with little knowledge of the oscillator 
frequencies, but now she knows them with high precision, because she knows for 
which numbers of clocks the system of linear equations could be solved. This 
good knowledge of the oscillator frequencies leads to very few possibilities for 
the numbers of clocks for the finite state machines. Alice applies these numbers 
of clocks to a simulation of the finite state machines in order to compute the 
next output of the RNG. 

6 Is the Described Attack Practically Relevant? 

The attack described above enables an attacker to predict output bits from the 
RNG after having seen some earlier output bits. The question is whether there 
are practical security applications where such an attack could be applied. 

One straight forward application of cryptographic RNGs is the generation 
of keys for symmetric cryptography. When a number of keys is generated sub- 
sequently for different users, the recipient Alice of a key could find out the key 
generated for the next user by applying the technique described above. Today, 
symmetric keys usually have 128 bits or more, so Alice can use her own key to 
determine the state of the RNG and only has to try a very small number of 
possible keys for the next user. Of course she does not have to stop there, she 
can continue with the next user but one, and so on. 

In the scenario just described, the attacker had to participate actively in a 
protocol in order to get her own key, from which she could derive they keys of 




188 



M. Dichtl 



other users. Can the attack of section 5 also be used by a passive attacker? For 
such an attack we need a protocol which generates and communicates random 
numbers in plaintext, and subsequently uses the RNG to generate a secret. This 
turns out to occur very often, namely the generation of a random challenge 
for challenge and response authentication, and subsequently the generation of a 
session key. 

7 Conclusion 

We showed that the random number generator described in [Tka03] is a com- 
bination of TRNG and PRNG elements. The TRNG elements produce little 
entropy when the random number generator is sampled at top rates. The output 
of the device can be predicted by taking into account both the small amount of 
entropy generated and the linearity of the PRNG elements. 

How can these problems be overcome? Obviously by strengthening the TRNG 
elements and/or the PRNG elements. 

The problem with the TRNG elements is that at top sampling rates the 
amount of state information it outputs largely exceeds the amount of entropy 
it generates. This can be cured by sampling less frequently, or by sampling 
less bits each time. We have seen in section 4 that the required reduction of the 
sampling frequency is rather impractical. To sample less bits each time the RNG 
is invoked, is more efficient. For example, if only one bit of output is generated in 
each output of the RNG, the data rate drops to 1/32 of the original design. But 
attacks like the one described above are impossible, because the device produces 
more entropy than it outputs state information. 

Goncerning the PRNG elements, non-linear components could be used to 
prevent attacks like the one described above. The disadvantage of only fixing 
the PRNG parts of the RNG is that this provides only computational security. 
Attacks are in principle still possible but require a large - hopefully too large 
for practical application - computational effort. In contrast, TRNGs provide 
information theoretical security. 
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Abstract. Representing finite field elements with respect to the 
polynomial (or standard) basis, we consider a bit parallel multiplier 
architecture for the hnite field GF{2™‘). Time and space complexities 
of such a multiplier heavily depend on the field defining irreducible 
polynomials. Based on a number of important classes of irreducible 
polynomials, we give exact complexity analyses of the multiplier gate 
count and time delay. In general, our results match or outperform 
the previously known best results in similar classes. We also present 
exact formulations for the coordinates of the multiplier output. Such 
formulations are expected to be useful to efficiently implement the 
multiplier using hardware description languages, such as VHDL and 
Verilog, without having much knowledge of finite field arithmetic. 

Keywords: Finite or Galois field, Mastrovito multiplier, pentanomial, 
polynomial basis, trinomial and equally-spaced polynomial. 



1 Introduction 

With the rapid expansion of the Internet and wireless communications, more and 
more digital systems are becoming increasingly equipped with some form of cryp- 
tosystems to provide various kinds of data security. Many such cryptosystems 
rely on computations in very large finite fields and require fast computations in 
the fields [5,1]. Among the basic arithmetic operations over finite field GF(2’”), 
addition is easily realized using m two-input XOR gates while multiplication is 
costly in terms of gate count and time delay. 

In the past, many bit parallel multipliers were proposed (see for example [3, 
9,2,11,6,10]). In [4,3], Mastrovito proposed an algorithm along with its hardware 
architecture for polynomial (PB) basis multiplication. In his scheme, first a bi- 
nary matrix is formed which is then multiplied with a binary vector to obtain the 
required result. Halbutogullari and Koc have given a method for constructing 
the Mastrovito multiplier for arbitrary irreducible polynomials [2] . This method 
considers general as well as special classes of irreducible polynomials such as 
trinomials, all-one polynomials (AOPs) and equally-spaced polynomials (ESPs). 
So far, for these special polynomials, the XOR gate count and time delay of 
the Halbutogullari-Koc algorithm appear to be the lowest. In [II], Zhang and 
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Parhi give a systematic method to design the Mastrovito multiplier. Moreover, 
in [11], the method is extended to design the modified Mastrovito multiplication 
scheme proposed in [8]. They also present new results on the complexities of 
the Mastrovito multiplier for two classes of irreducible pentanomials. Recently, 
Rodriguez-Henriquez and Koc in [7] have proposed a PB multiplier for special 
case of pentanomials and have given its time and gate complexities. 

In this article, first we review the multiplication scheme and its bit-parallel 
architecture presented in [6]. Then, using the reduction matrix Q, the complexi- 
ties of the multiplier based on a number of irreducible polynomials are obtained. 
We also present explicit formulations for the output coordinates of the multiplier 
in terms of its inputs. Such formulations can be directly coded using VHDL or 
Verilog languages to implement an efficient multiplier by someone who is not 
that familiar with finite field arithmetic. It is shown that for general irreducible 
polynomials, the space and time complexities of the proposed structure are lower 
than those available in the literature in terms of combined gate count and time 
delay. Furthermore, this architecture has fewer signals to be routed which is 
advantageous for VLSI implementation. 



2 Polynomial Basis Multiplications over GF{2^) 

Let P{x) = X™ -I- be a monic irreducible polynomial over GF{2) 

of degree m, where pi€GF{2) for z = 0, 1, • • • , m — 1. Let a G GF{2"^) be 
a root of P{x), i.e., P{a) = 0. Then the set {1, a, o? , • • • , is referred 

to as the polynomial or standard basis and each element of GF(2"‘) can be 
written with respect to (w.r.t.) the polynomial basis (PB). Let A be an element 
in GF(2’”), then the representation of A w.r.t. the PB is A = G 

{0, 1}, where ads are the coordinates. For convenience, these coordinates will be 
denoted in vector notation^ as a = [oq, ai, 02 , • • • , Om-i ]^, where T denotes the 
transposition. Using this vector notation, the representation of A can be written 
as A = a^a, where a = [1, a, o? ^ • • • , Let S be the binary polynomial 

of degree not more than 2m — 2 obtained by the direct multiplication of the PB 
representations of any two elements A and B of GF(2’”), i.e., 

( m — 1 \ / m—1 \ m—1 m—2 

ME = E + E (1) 

i— 0 / yi— 0 J k—0 k—0 

where 



d = [do, di, • • • , dm-i = Lb, (2) 

G — [cq, Cl, * * * , ] — Ub, (3) 



^ In this paper, vectors and matrices are shown with small and capital bold faces, 
respectively. 
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Then, the product C = A- B can be obtained by the following modulo reduction. 



m— 1 

C = CiO* = S mod P{a). (5) 

i=0 

Definition 1. [3] The reduction matrix Q is an m — 1 hy m binary matrix 
which is obtained from 

oA = Qa (mod P{a)), (6) 

where oA = [a™, • • • , 

Theorem 1. [6] Let C be the product of A and B G GF(2'"). Then, 

c = [co, Cl, • • • , Cm-i]^ = d + Q^e, (7) 



where d, e and Q are defined in (2), (3), and (6) respectively. 



The corresponding architecture for polynomial basis multiplication over 
GF(2™) is shown in Figure 1. This structure is divided into two parts: IP-network 
and Q-network. The IP-network has m blocks (denoted as Iq, /i , • • • , Im-i) which 
generates vectors d and e in accordance with (2) and (3), using m? AND gates 
and (to — 1)^ XOR gates. Using (2) and (3), the delay for dj, 0 < j < m — 1, 
and 6i, 0 < i < m — 2, can be calculated from 



T {d j) =Ta+ \ log 2 {j + l)^Tx, 0<j<m-l, (8) 

T{ei) =Ta+ i - l)~\Tx, 0 < i < m - 2. (9) 



In Figure 1, the Q-network takes d and e as inputs and generates c. It is 
noted that the number of lines on the interconnection bus IB is fixed and is equal 
to the number of ej’s, i.e., to — 1. In Figure 1, there are three buses. A, B and 
IB, and the number of lines on the buses is 3 to — 1. 

In the following sections, we attempt to minimise the number of XOR gates of 
the Q-network for special irreducible polynomials, namely equally-spaced poly- 
nomials, trinomials, and pentanomials. We start with equally-spaced polynomials 
which are very structured and will help us present the remaining special cases 
with less difficulties. 

3 Multipliers Using Equally-Spaced Polynomials 

Definition 2. A polynomial P{x) = -I- -I- • • • -I- a;'* -I- 1, over GF{2), 

with ns = m and 1 < s < is called an equally- spaced polynomial (denoted 

as s-ESP) of degree m. 
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Fig. 1. Architecture of the multiplier over GF(2"'), where CS^ represents an i- fold 
cyclic shift. 



When s = 1, we have 1-ESP which is the same as the all-one polynomial 
(AOP) which has the highest Hamming weight among all polynomials of degree 
TO. On the other hand, s = results in the least Hamming weight irreducible 
polynomial (i.e., trinomial) of degree to. It is easy to check that for an equally 
spaced trinomial to is even and s = ^. 

Theorem 2. For an s-ESP based multiplier over GF{2™), the number of AND 
gates (Na), the number of XOR gates (Nx) and time delay (Tq) are Na = rnf, 
Nx = rnf — s, and Tc = Ta + {^ + [log 2 to]) Tx, respectively . 

Proof. When a is a root of the s-ESP of degree to as defined above, we have 

m+i _ / + • • • + 0 < i < S, ... 

s<i<m-2. ^ ’ 

Using (10), the reduction matrix Q is obtained as 



Q 



Is Is ' ' ' Is 

Im— s — 1 Os-t-1 



( 11 ) 



where Ij is the j x j unity matrix and Os_|_i is a zero matrix which has to — s — 1 
rows and s-l- 1 columns. The graphical representations of Q in (11) for different 
values of s are shown in Figure 2. In this figure, non-zero entries of Q are shown 
with the small squares. 

In order to obtain exact expressions for Nx and Tc, first we obtain the 
coordinates of C. To this end, from Theorem 1 and (11), one can write 



Cj — O-j mod s, ^ 'Fi j "Fi ^ 1; 



(12) 
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Fig. 2. Graphical representations of the locations of non-zeros entries of Q for s-ESP 
P{x) = x"‘ + + ■ ■ ■ + x” + 1, m = ns. {a) s = 1 (AOP), (b) 1 < s < f , (c) 

s = ^ (trinomial). 



where 

— s — 2, (13) 

^ ( d j m — s — l<j<m— 1. ^ 

Thus, using (12) and (13), the exact XOR gate count for an s-ESP based 
multiplier is Nx = rn? — s. Also, by using (8) and (9), d'. of (13) can be generated 
with a maximum gate delay of -|- (1 -I- [log 2 m] ) Tx- 

It is worth mentioning that the resultant number of signal lines on IB reduces 
from TO — 1 to s, which is considerably lower than the s-ESP based Mastrovito 
multiplier which has +m signal lines [4]. Thus, the total number of lines 

on the buses of the multiplier is 2m + s. 

4 Extension to More Generic Polynomials 

Here we consider irreducible polynomials of the form P(x) = + x^* -I- • • • -I- 

xk 2 _|_ 2 ;fei _|_ where 1 < ki < k2 < ■ ■ ■ < kt < y ■ The Hamming weight of P{x) 
is t -I- 2 and the degree of the second leading term is less than or equal to ^ . 
All five binary fields recommended by NIST for ECDSA can be constructed by 
such irreducible polynomials. 

In order to apply the general formulation stated in Section 2 to these poly- 
nomials, first we obtain the corresponding Q matrix. Note that all the rows of 
the Q matrix are the PB representations of 0 < f < to — 2, where a is a 

root of P{x). Since P{a) = 0, then o’” = 1-1- -I- • • • -I- a^*. Thus, the 

0-th row, i.e., z = 0, has only ones in these t -I- 1 columns of Q: 0, fci, ^ 2 , • • • , fct- 
The consecutive rows of this matrix can be obtained by using a linear feedback 
shift register (LFSR). As a result, the rows with z = 0 to to — — 1 of Q have 

t -I- 1 ones. 

The Q matrix for t = 1 and t = 3 (i.e., trinomials and pentanomials, respec- 
tively) are shown in Figure 3. As shown in this figure, row z, 0 <z<to — A:* — 1 
of Q has t -I- 1 ones corresponding to the t -I- 1 segmented lines. When the last 
column of Q contains one which takes place in row i = m — kj — 1, j = t, • • • , 2, 1, 
the next row originates new t -|- 1 lines in columns: 0, /ci, k 2 , up to kt provided 
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Fig. 3. Graphical representations of the reduction matrix Q for trinomials: (a) k = 
fci = 1 (b) 1 < fc < ^ (see Figure 2(c) for fei = ^); and for pentanomials: (c) k\ = 1 
(d) 1 < fci < f . 



that there is no previous lines that pass these columns. If there exists a previous 
line that passes the column kj, 1 < j < t, then the previous line terminates in 
column kj — 1 and no new line originates from column kj due to XORing of two 
lines. This happens in row ^ and column ™ in Figure 2(c) for trinomials when 
ki = Y- This is also the case for pentanomials where t = 3 and it is shown in 
Figures 3(c) and 3(d) for ki = 1 and 1 < fci < ^, respectively. 

We divide the lines of Q into t + 1 sets (see Figure 4 for f = 3) such that 

Q = Qo + Qi + Q 2 H \-Qt where non-zero entries of Qi, 0 < z < t start from 

the column ki (assume that ko = 0). It is noted that the last non-zero entry of 
sub-matrix Qi, 1 < z < t is in column m—1, whereas the one in Qo is in column 
TO — 2. Moreover, the number of ones in each column of Qi, 0 < z < t is at most 
t -I- 1 if fci > 1, and t if ki = 1. 




Fig. 4. Graphical representations of submatrices of Q = Qo + Qi + Q 2 + Qs for 
pentanomials P{x) = + x^^ + x^^ + x^^ + 1, where 1 < fci < /c 2 < fcs < ^, (see 

Figure 3(d) for Q). (a) Qo, (b) Qi, (c) Q 2 , (d) Q 3 . 



Theorem 3. The number of XOR gates and the time delay of the multiplier 
based on the irreducible polynomial P{x) = -I- • • • -I- x^^ + x^'^ + 1, 

\ < ki < k^ < ■ ■ ■ < kt < are 

Nx = (to -I- 1) (to — 1) 



and 
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?b = + (^riog2(t + 1)1 + log2( 2 + |'log2(TO- l)lj Tx. 

Proof. Let us denote = [cq \ • • • , = Qfe, 0 < i < t, then using 

Theorem 1, we can obtain the coordinates of the pentanomial based multiplica- 
tion as 

c = d -h H he^*). (14) 

First, let us assume fci yf 1. Using Qo (see Figure 4(a) for t = 3), the elements 
of are as follows: 

f + ej+m-fct -f • • • -f ej+m-k2 + Cj+m-fei 5 if 0 < j < /ci — 2 



+ ^j+m-kt + • • • + ej+m-k2 if ^1 ~ 1 < J < ^2 ~ 2 

e® = < • • 

Cj + Cj+rn-kt if kt -1 -1 <j <kt-2 

6j if fct — 1 < j < m — 2 

0 if j = TO — 1 . 

(15) 

The total number of XOR gates to form 0 < j < fct — 2, is iVi = 

t(A:i - 1) -I- (t- l)(fc 2 - fci) H \-kt- kt-i = XlLi h~t. Let T{ef^) denote the 

time delay due to gates to find As seen in (15), the longest path delay is to 

obtain e[,°^ = cq -I- Cm-kt H 1- e^-fes + em-fci, i.e., T(e®) < T(e[,°^). In order 

to reduce this delay, we first add any two terms except cq, e.g., Cm-kj + ^m-ki, 
1 < bi ^ t, i j- Then add these |"|] signals to cq using a binary tree of 
XOR gates. Since T(cj) =Ta + riog 2 (m - j - 1)1 Tx, then T{em-kj +em-ki) < 
Tx + T{em-kt) =Ta + {1+ riog 2 (fct - 1)1 )Tx < -I- riog 2 (TO - 1)1 Tx, where 

the last inequality is due to fcj < ff. Thus, we have 

T(p^% < (riog 2 (ril + 1)1 + riog 2 (m- l)l)Tx, ifO < j < fct -2 

- \ Ta+ riog2(m - 1)1 Tx if fct - 1 < j < TO - 2. 

By reusing the signals of e^’s, the coordinates of for 1 < f < t, can be 
obtained as 

W _ / 0: if 0<j <h-l 

I otherwise. 

This results in the coordinates of C = AB as 



O “ 



,(0) 

'j 


if 0 < j < /ci — 1 


,(0) 1 Jl) 

'j ^ ^3 


if /ci < j < fc2 ~ 1 


f + • ■ 


+ Akt-i <j<kt- 


f + + • ■ 


■ ■ + if fct < j < TO — 2 


f + ef+.. 


• • -1- if j = TO — 1 
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by using (14). To realize (18) in hardware, one requires N 2 = m + (fc 2 — k\) + 

2(fc3-/c2)H V{t-l){kt-kt-i)+t{m-k^-l)+t-l = (t+l)m-X)i=i ki~l XOR 

gates. Thus, the total XOR gates needed for the multiplier is (m— l)^+7Vi+7V2 = 
(m + t){m — 1). 

To obtain the time delay of the proposed multiplier, we use a binary tree 
for each coordinate in (18). For j ^ [fct,m — 2], it is seen in (18) that Tq < 
[log 2 (t + 1)1 Tx + T(eg°^) and the proof is complete by using (16). Now, we need 
only to obtain the time delay of c' s for kt < j < m — 2. For j G [kt,m — 2], if 

we form Cj = {dj + e^°^) + H h such that dj + is calculated 

first, then 

T{dj + ef) <Ta + {1+ riog 2 (m - 1)1 )T^ 

<Ta+ log 2 ( ^ + 1) + [log 2 (m - 1)1^ Tx- 
Also, using (17) and (16), one can see 

T{ef) < Ta + ^ log 2 ( ^ + 1) + [log 2 (m - 1)]^ Tx 
which implies that 

Tc <Ta+ ^[log2(i + 1)1 + log2( ^ + 1) + riog2(w - 1)1^ Tx 

and the proof is complete. 

In addition to the three buses shown in Figure 1 now, there will be another 
bus in the middle of the Q-network for signals for 0 < j < kt — 2. Thus, the 
total number of lines on the buses is 3m + kt — 2. 

Corollary 1. For ki = 1 and t > 1, the time delay would reduce to 

Ta + (^riog2(t + 1)1 + log2 ^ + riog2(m - 1)1^ Tx. 

Based on the above results, one can obtain the time delay and the number of 
XOR gates for the trinomial based multiplier by substituting t = 1 in Theorem 
3, for fci yf Y s = ^ in Theorem 2 for fci = ^. Note that the results for 
ki = Y are obtained using the implementation of the ^-ESP based multiplier. 

5 Special Classes of Pentanomials 

A polynomial with five non-zero coefficients, i.e., P{x) = + 1, 

where 1 < fci < k 2 < fcg < m — 1, is called a pentanomial of degree m. The 
non-zero constant term is due to the irreducibility properly needed to define 
the field. In terms of the values of kiS, the pentanomials can be divided into a 
number of different classes. Below we consider two special classes of irreducible 
pentanomials as proposed in [11]. 
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5.1 Class 1: ks<f 

For this class of irreducible pentanomial where fca < ^, one can apply t = 3 to 
the complexity results we have presented in Section 4. This yields the following. 

Corollary 2. The gate counts and time delay of the multiplier for the the pen- 
tanomial P{x) = a;™ + + x^'^ + + 1, where I < k\ < k 2 < < ff, 

are 



Na = 

Nx = + 2m — 3, 

rp ^ f Ta + (3 + riog2(m - 1)1) Tx, ifki = 1 
^ + (4 + |"log 2 (TO — 1)1) Tx, otherwise, 

and the number of lines on the buses is 3m + fcs — 2. 

The number of XOR gates can be reduced if we choose a pentanomial such 
that ki = fcs — /c 2 - Towards this, let us introduce the following set of new signals 



Cj — T t-j+m— ^ — j — k2 2. (19) 

Equation (19) can be used to generate 0 < j < A :2 — 2, by substituting t = 3 
in (15) as follows 




€j Cj if 0 ^ J ^ ki 2 

Cj + e' if fci — 1 < j < ^2 — 2 

< Cj + ej+rn-k3 if ^2 - 1 < j < fca - 2 

Cj if fca — 1 < j < m — 2 

0 if j = m — 1 . 



(20) 



The total number of XOR gates needed to generate e(°^’s (see (20)) is fVi = 
ki k 2 ks — 3 where k 2 — 1 of which is due to (19). Also, the maximum delay 
due to gates in (20) is 

r Tx + (2 + riog2(m - 1)1) Tx if 0 < J < fci - 2 
T(ef ) < <^ Tx + (1 + riog 2 (m - 1)1) Tx if - 1 < j < fcg - 2 (21) 

[Ta-\- riog 2 (m - 1)1 Tx if /C3 - 1 < j < m - 1. 



Lemma 1. With symbols defined as above, one has 

^ form-k 2 <j <m-2, 

ef ^ + ef ^ k^ < j < m - 1. 

Let us represent 0 < j < m — 1, as the elements of (Qo + Qi)^e, 

where Qo and Qi are shown in Figure 4(a) and Figure 4(b), respectively. Then, 
substituting t = 3 in the general case given in (18) and using the above lemma, 
we can obtain the coordinates of C = AB as follows: 

O = dj + + ef^l^, 0 < j < m - 1, (22) 

where = 0 for j < k 2 , and 
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if 0 < j < /ci — 1 
ef'’ + liki < j <m- k2- I 
e'j+k^-m if m - fc2 < j < m - 2 
if j = TO — 1. 



(23) 



As seen in (23), one has to realize for all fci < j < to — /c2 — 1 which 

requires to — ^2 — fci XOR gates. Once are obtained, then equation (22) 

requires 2 to— / c2 XOR gates. Thus, the total number of XOR gates needed for the 
multiplier is (to— l)^+iVi+TO— ^2— fci+2TO— /c2 = TO^+TO+fci— 2. Due to the reuse 
of terms e' , 0 < j < k2 — I, and e® + fci < j < to — ^2 — 1 , additional lines 
needed on the bus in the Q-network are (fe — 1) and (to — k\ — ^2), respectively. 
Thus, the total number of lines on the buses is increased to 4 to + ^2 — 3. 

To obtain the time delay of the proposed multiplier, we use Table 1 which 
shows the maximum delay of the used signals in (22) for the given ranges of 
j in each row. In this figure t, 0 < z < 4, represents the time delay of T4 + 
(z + |"log2(TO — 1)]) Tjf, and the numbers inside brackets are for ki = 1. Also, 
X determines either or to be added with dj first to obtain Cj. In 

each row of this table, the delays are obtained for the first digit of the given 
range. This is because as j increases, the time delays of the used signals in each 
row of this table decreases. As seen in this table, the maximum delay of the 
multiplier is + (4 + |'log2(TO — 1)]) Tx- For ki = 1, only one signal, i.e., Ck^, 
has the delay of Ta + {^+ |"log2(TO — 1)]) Tx- One can reduce this delay to Ta + 
(3 + riog2(TO - 1)] ) Tx if only Ck^ is realized as Ck^ = {{dk^ +e^f'^) + e-f'^) + e-k3-k2 
by using one extra XOR gate. 



Table 1. Maximum time delays of the signals, where z, 0 < z < 4, represents the time 
delay of Ta + (z + [log 2 (m — 1)]) Tx, numbers inside brackets are for k\ = 1, and x 
determines either or to be added first with dj. 



j 


(0) 

e,' 




(01) 

e.) 


Aoi) 


dj + X 


Cj 


0<j<ki-l 


2(1) 


- 


2(1), X 


- 


3 


3 


ki < j < k2 — 1 


1 


2(1) 


3(2), X 


- 


4(3) 


4(3) 


k2 < j < ks- 1 


1 


2(1) 


3(2) 


2(1), X 


3(2) 


4(3) 


ks <j < ks + ki — 1 


0 


1 


2, X 


3(2) 


3 


4 


ks + ki < j < m — k 2 — 1 


0 


0 


1, X 


3(2) 


2 


4(3) 


m — k 2 <j<m— 1 


0 


0 


1, X 


3(2) 


2 


4(3) 


j = m-l 


- 


0 


1, X 


1 


2 


3 



Based on the above results, we can state the following. 

Theorem 4. The gate eounts and time delay of the multiplier based on the 
pentanomial P{x) = x^ + x^^ + + x^^ + 1, where 1 < k\ < k 2 < kz < ^ , 

and fca — /c 2 = kiare 

Na = rri^, 
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^ _ r rn? + m ifki = 1 

^ ^ + m + ki — 2 otherwise, 

rp ^ f Ta + (3 + riog2(m - 1)1) Tx, ifki = 1 

^ 1^ Tx + (4 + |’log 2 (m — 1)1) Tx, otherwise, 

and the number of lines on the buses is 4m + /c 2 — 3. 



Remark 1. To verify that class 1 irreducible pentanomials exist, we have used a 
Maple"'"'^ program for m G [160, 600] and have found that at least one irreducible 
pentanomial exists for each m in the range of 160 to 600. This is of interest to 
elliptic curve cryptosystem designers. In order to minimise the number of XOR 
gates of the multiplier, we have obtained irreducible pentanomials such that ki 
is minimum. We have also observed that, fci is less than or equal six for all m in 
the above mentioned range. 

It is noted that the pentanomial presented in [7] is a special case when fci = 1. 



5.2 Class 2: m - ks ^ ka - k 2 ^ k 2 - ki ^ s, < s < 

We refer to polynomials R(x) = + x^^ + x^^ + x^^ + 1, where 1 < fci < ^2 < 

ka < m — 1, and m — ka = ka — k 2 = k 2 — k\ = s as class 2 type. Similar to 
the other special irreducible polynomials, here we first obtain the corresponding 
reduction matrix. Then the coordinates and complexities of the multiplier can be 
obtained. Based on the values of s (or k\ = m — 3s), we can divide the reduction 
matrix into different forms. Because of lack of space, only three of them are 
presented here. These Q matrices for < s < (or 1 < < 5s + 1) are 

shown in Figure 5. Based on this figure, we can state the following theorem. 



0 kQ ^3 m — 1 



0 ki kQ k^ m — 1 



0 ki k 2 ks m — I 






Fig. 5. Graphical representations of the reduction matrix Q for class 2 pentanomials 
P(x) = X™ + + x'^^ + x'^'^ + 1, where m — fcs = fca — ^2 = — fci = s. (a) 

< s < ^ or 1 < fci < s + 1 (see Figure 2(a) for fci = s), (b) ^ <s< ^ 
or s + 1 < fci < 2s + 1, (c) < s < or 2s + 1 < fci < 5s + 1. 
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Theorem 5. The gate counts and the time delay of the multiplier for the pen- 
tanomial P{x) = x™ + + 1 , for ^ < s < are 

Na = mf, 

Nx=\m'^ + 2m-^s-2 if^ <s<^ 

[m'^ + m-2 if"^<s<^ 

^(TA + {S+\log^{m-mTx,if^<s<^ 

^ Ta + (4 + |"log 2 (m — 1)]) Tx^ otherwise. 

Remark 2. Using Maple"’’^, we have found that there exists 147 values of m, 
where m € [160,600] such that polynomial P(x) = x’^ + x”^~^+x’^~^^ + x™'~^^ + 
1, 1 < s < is irreducible. Among them only 23 have 1 < s < 



Table 2. Comparison of related polynomial basis multipliers. 



Reference | Special Case | ^XOR Time delay 


P{x) = x"" + + ■ ■ ■ + X* + 1, m = ns 


This paper, [2,11] 1 - j m‘‘ — s Ta + (1 + [log 2 m] ) Tx 


P(x) = X™ + x'^ + 1 


[9,2,11] 

[9,2,11] 


k = l 
Kk<f 


— 1 

rn? — 1 


Tpi + (1 + \l 0 g 2 m~])Tx 
Ta + (2 + [log 2 m~\ ) Tx 


This paper, ]10] 


l<k< ^ 


— 1 


Ta + ( 2 + ]log 2 (m- 1)1) Tx 


1 P(x) = X™ + x'“ + ■ ■ ■ + x'^^ + x'^l + 1, 1 < fcl < fe 2 < ■ ■ ■ < fct < ^ 1 


[11] 

This paper 


t > 1 
t > 1 


(m + t){ra— 1) 
(m + t)(m — 1) 


Ta + (2t + ]log 2 m] ) Tx 

Ta + ([log2([|] + 1)1 + 

]log 2 (< + 1)1 + ]log 2 (m - l)l)Tx 


1 P(x) = X™ + x'“3 + x'“2 + x*"! + 1, 1< fci < fc 2 < fca < ? 1 


[11] 


fci > 1 


+ 2m — 3 


Ta + (6 + ]log 2 m] ) Tx 


This paper 
This paper 


fcl > 1 
ki = 1 


+ 2m — 3 
+ 2m — 3 


Ta + (4+ ]log 2 (m - 1)1) Tx 
Ta + (3+ ]log 2 (m- 1)1) Tx 


This paper 


ks - k2 = ki 


rrP + m + fci — 2 


Ta + (4+ ]log 2 (m - 1)1) Tx 


[7] 

This paper 


1 1 

II II 

??■ ?r 

II II 


m^ + m + 2 /c 2 
m? + m 


Ta + (3+ ]log 2 (m- 1)1) Tx 
Ta + (3+ ]log 2 (m- 1)1) Tx 


This paper, [7] 


ki = i 


m^ + m 


Ta + (3+ ]log 2 (m- 1)1) Tx 


1 P(x) = X™ + x”*-" + x"*-"'’ + + 1 1 


[11] 

[11] 


1< <j < 

s< =1^ 


m^ + 4m — 5s — 5 
> m? + 2.33m — 7 


Ta + (L|J + 4+]log2(m-l)l)Tx 
> Ta + (4+ ]log 2 (m- 1)1) Tx 


This paper 


^ < s < 

2 Z= s2 


< m^ + m 


< Ta + (4 + ]log 2 (ni - 1)1) Tx 



6 Complexity Results and Concluding Remarks 

In this article, time and space complexities of bit parallel multipliers for GF{2'^) 
have been considered. A comparison of our newly derived gate counts and delays 
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Table 3. Comparison of the structure of Figure 1 with the Mastrovito multiplier in 
terms of number of number of lines on the buses. 



Multipliers 


fk Lines on the buses 


trinomial 


s-ESP 


pentanomial 


generic 


Mastrovito [4] 


3m — 1 


+ 2 m 


5m — 3 


{t + 2){m-l) + 2 


This paper 


3m — 1 


2 m - 1 - s 


< 4m - 1 - k 2 


3m + kt — 2 



with those of existing ones is shown in Table 2. As seen in this table, for trinomial 
x'^+x+1, the multiplier of Figure 1 has one additional XOR gate delay compared 
to the best one available in the literature, i.e., [2,11]. However, our results for the 
ESPs and trinomials (fc yf 1) match the corresponding best results available ([2, 
11] and [9]). Also, the resultant gate and time complexities for trinomials match 
those presented in [10]. 

For a more generic irreducible polynomial as discussed in Section 4, the 
multiplier in Figure 1 has the same gate count but a shorter time delay compared 
to [11]. For class 1 pentanomials, this multiplier is faster than [11] and has fewer 
XOR gates if the special case of — k 2 = k\ is used. This proposed special 
case of class 1 covers the case of pentanomials reported in [7], where ki = 1. 
Compared to the multiplier proposed in [7], the multiplier discussed in this paper 
for the special case of ki = k^ — k 2 = ^ has 2k2 fewer XOR gates and match the 
ones proposed in [7] for ki = 1 and k 2 = 2. Also, for class 2 pentanomials, our 
multiplier is either faster or has the same gate delay and has at least 1.33m — 7 
fewer XOR gates than the multiplier reported in [11]. 

In VLSI implementation, in addition to the gate counts, the number of lines 
on the buses is also an important parameter which determines the space com- 
plexity and consequently its actual time delay. Table 3 compares this metric of 
the proposed architecture with that of Mastrovito multiplier [4]. As shown in 
this table, the architectures discussed here have a fewer number of lines on the 
buses compared to the well known Mastrovito multiplier. 
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Abstract. This paper describes a new efficient method of modular re- 
duction in IFq [a;] suited for both software and hardware implementations. 
This method is particularly well adapted to smart card implementations 
of elliptic curve cryptography over GF(2^) using a polynomial represen- 
tation. Many publications use the equivalent in lF 2 [a;] of Montgomery’s 
modular multiplication over integers. We show here an equivalent in 
IFq[a;] to the generalized Barrett’s modular reduction over integers. The 
attractive properties of the last method in IF 2 [a;] allow nearly ideal imple- 
mentations in hardware as well as in software with minimum additional 
resources as compared to what is available on usual processor architec- 
ture. 

An implementation minimizing the memory accesses is described for both 
Montgomery’s implementation and ours. This shows identical computing 
and memory access resources for both methods. The new method also 
avoids the need for the bulky normalization (denormalization) which is 
required by Montgomery’s method to obtain a correct result. 

Keywords: Smart card, cryptography, modular multiplication, quotient 
evaluation, elliptic curves, ECDSA, Montgomery, Barrett, multiply and 
add without carries, multiplications in lF 2 [x]. 



1 Introduction 

Montgomery’s multiplication in IF 2 [a;] is a well known “right to left” modular 
reduction (see for example [1]) and directly derives from Montgomery’s mod- 
ular reduction over integers [2,3]. It is often used for efficient software imple- 
mentations of elliptic curves crypto-systems like ECDSA [4], using polynomial 
representations. The main disadvantage of Montgomery’s method is the bulky 
normalization - denormalization phase required to obtain a correct modular re- 
duction. In the set of integers, the corresponding “left to right” methods [5,6, 
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7] are less efficient in software because they require additional non standard (on 
general purpose microprocessors) resources (special multiplier) to obtain similar 
performances. We will show that the generalized Barrett’s method can be made 
more efficient in IF 2 [a;] so that it can compete directly with the Montgomery’s 
method. Even more, because of the absence of normalization, the memory re- 
quirements for data and ROM code of our method are smaller. This could be of 
main importance in constrained implementations (e.g. smart cards). 

The easiest “left to right” method to compute a quotient and a remainder 
is the one taught at school: the quotient is calculated by first zeroing the upper 
(most significant) part of the numerator by adding / subtracting a multiple of 
the denominator. Montgomery’s method, however, zeroes the least significant 
bits of the number to reduce. 

In section 2, the way how the quotient is computed in order to perform an 
improved “left to right” modular reduction in IFg[x] is described. 

In section 3, the particular case of the modular multiplication in IF 2 [a:] is 
studied. 



2 Quotient Evaluation in IFq[a;] 



clidean division [8 
we define Q{x) = 



In order to compute S{x) = U{x) mod N{x) we could first evaluate, in a scholas- 
tic way, the quotient Q(x) defined by the equation U(x) = Q(x)N(x) + S(x) 
where S(x) and Q(x) are respectively the remainder and the quotient of the Eu- 
of U (x) by N (x) . By similarity to computations over integers 
. The degree p of the polynomial N(x) is noted deg(iV). 
We also define a = deg(U) — deg(iV). 

In most applications, like in elliptic curve crypto-systems, N(x) is fixed (when 
working on a given elliptic curve). To speed up the computations, we can pre- 
compute (= R(x)), for some value of P defined hereafter. The quotient’s 

evaluation may then be reduced to a multiplication and an appropriate shift 
(division by a power of x) as shown in equation 1 (see [5] and [7] when working 
over integers). 



Q{x) = 



U{x) 






T{x)R{x) 


[ XP 


N{x) 




X 


■fi 




x^ 



( 1 ) 



For the comprehension of equation 1, \_A{x) / B{x)\ represents the quotient 
of the polynomial division of A(x) by B(x), discarding the remainder (A(x) 
and B(x) are some polynomials). In the next sections we will demonstrate the 
equivalence of equation 1 with the real quotient Q{x) = [U{x)/N{x)\. 



2.1 Equivalence between Q{x) and Q{x) 

In the present section, Ak{x) represents the polynomial A(x) of degree k (where 
A{x) is some polynomial). 
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We can write: 






xP xP 

where <l>a{x) is the quotient and ipp-i(x) the remainder of the division. Sim- 
ilarly, we can write: 



xP+f^ 

N{x) 



Af){x) 



Np{x) 



Where Af}[x) is the quotient and Ap_i(a;) the remainder of the division. We now 
can write: 



Q{x) = 


(U(x)\ (x^+P\ 
\ xP J \N{x) J 




x0 












X0 



<Pa{x)Ap{x)+t. 'Pa(x) 



' X' 

V 



0 ) 



(2) 

( 3 ) 

( 4 ) 



The second term of equation 4 is nullified only if /? > a. In that case only we 
can write: 



Q{x) 



<Pa{x)Ap{x) 




L<l>a(x)J Ld/3(x)J 


X0 




x0 



<Pa{x) 



XP 



A0{x) ■ 



Ap_ 1 (m) 
Np{x) 




= Q{x) 



( 5 ) 

(6) 

( 7 ) 



□ 

There is no need to chose (3 > a because this would require more computa- 
tions than when choosing (3 = a as R{x) will then be larger. 

The previous quotient computation is thus valid for Fg[a;]. 

In the next section, we will use our method in IF 2 [a;] which will show improve- 
ments of its software (and hardware) implementations. Q{x) is then a binary 
polynomial of degree a requiring a -|- 1 bits for its binary representation. 



3 Modular Multiplication in IF 2 [ai] 

In F 2 [x], the coefficients rii of the polynomial N{x) = UpxP + Up-ixP~^ -I- . . . -|- 
nix+no are either ‘0’ or ‘1’. This gives a binary representation of the polynomials 




206 



J.-F. Dhem 



in F 2 [x ] , the upper bits of the representation being the upper coefficients of the 
polynomial. For example, the polynomial x® + + 1 can be represented as the 

binary number ‘101001’. 

The modular multiplication in IF 2 [x] is one of the most important operation 
in elliptic curves cryptography over GF(2^'). We will show that one can obtain, 
with the method described in the previous section, similar performances to the 
one using Montgomery’s modular multiplication in GF(2^) [1]. 

From now on, we will suppose that the polynomial modular multiplication 
is implemented on a t-bit architecture (e.g. 32-bit). The modular multiplication 
A{x)B{x) mod N{x) can be written as a sum of modular products of words of 
the first operand by the second operand: 

PA-i 

Ai{x)B{x)x'‘* mod 7V(x) = U{x) mod N{x) (8) 

i=0 

PA-l 

where, ^i(x) is a polynomial of degree t—1 such that A{x) = Ai{x)x'‘*' and 

z— 0 

where pA = |"xl degree of A{x). 

Let’s first recall some characteristics on the polynomial computations in 
F 2 W: 

• The product of a polynomial of degree t—1 (which can be represented as a 
t-bit binary vector) by a polynomial of degree n — 1 is a polynomial of degree 
n + t — 2 represented as a (n -F t — l)-bit vector. Over integers however, the 
result of a product of a t-bit by an n-bit integer is a (n -F t)-bit number. 

• The result of a polynomial addition of two polynomials of degree p is a 
polynomial of degree p (same number of bits in its binary representation) . 
Over integers, the result may have one more bit in its binary representation 
because of carry propagations. 

• The “modulo” operation (remainder of the division between two polynomi- 
als) gives a polynomial which is one degree smaller than the modulus. This 
means that the binary vector representing the remainder has always one bit 
less than the one of the modulus. Over integers, the remainder is smaller 
than the modulus but can still have the same number of bits in its binary 
representation. 

Given equation 8, we can now evaluate how the size of quotient Q{x) (equa- 
tion 1) changes. To reduce the required memory and minimize accesses to it, 
equation 8 can be computed by interleaving the multiplication (from the highest 
index of Ai to the smallest one) with the reduction (by N{x)) as shown in figure 

1 . 

Using the above characteristics of computations in IF 2 [x], the temporary 
polynomial U{x) is always, at every stage as shown in figure 1, at a maximum 
degree of t + pn — 1- This means that when computing the quotient Q{x) as in 
equation 1, a has to be replaced by t — 1. The corresponding T(x) in equation 
1 will then be of degree t—1 and i?(x) in equation 1 of degree t — 1. A binary 
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1 


U{x) = 0 


2 


for i = pji — 1 down to 0 


3 


U{x) = U{x)x* © Ai{x)B{x) 


4 


Q{x) = [U{x)/N{x)\ 


5 


U{x) = U{x) © Q{x)N{x) [U{x) = U{x) mod N{x)] 


6 


end for i 



Fig. 1. Modular multiplication in IF 2 [a;]. 

vector representation in t-bit will thus perfectly match the computations on a 
t-bit architecture (CPU). 

This also means that our method would require “standard” computations as 
in Montgomery’s method described in [1]. In Montgomery’s multiplication, a t- 
bit polynomial multiplication (with polynomials of degree t — 1) and a division by 
X* is required. In the present case, a t-bit polynomial multiplication and a division 
by is needed. The remaining part of the computations can be the same in 
both methods but the way the computations are made is different (we start 
here from the most significant word instead of Aq in the Montgomery’s 

method) . 



1 


U{x) = Ap^_i{x)B{x) 


2 


for i = PA — 2 down to 0 


3 


Q(x) = [(T(x)i?(x))/x*"iJ 


4 


1 

1 

O 

o 

II 

£ 


5 


U(x) = (U(x) © Q(x)Nj(x))x*^^~*~^^ © Ai(x)Bj(x)x*^ 


6 


end for j 


7 


end for i 


8 


Q = \U{x)/N{x)\ 


9 


U{x) = U{x) (B Q{x)N{x) 



Fig. 2. Interleaved modular multiplication in IF 2 [a;]. 

It is possible to further reduce the memory accesses on U{x). This is very 
important since memory accesses are often an important limiting factor in terms 
of execution time, namely in smart cards. To do so, we take the first computation 
Ap^_i(x}B(x) out of the i loop as shown in figure 2 allowing the i loop to start 
with the quotient computation (Q(x)) and the two computations of U(x) in line 
(3) and (5) as shown in figure 1 to be merged into line (5) as shown in figure 
2. Such a computation requires a final reduction outside the i loop (lines (8) 
and (9) in figure 2). The only disadvantage of interleaving the multiplication 
and the reduction phase using a unique j loop is that the numbers of 
and Bj{x) are identical {pb = Pn), meaning that if, for example, the degree of 
B{x) is smaller than the one of N{x) it should be padded left with zeroes when 
storing it in B[j]’s. This does nevertheless not influence the speed of practical 
implementations since B{x) is normally considered with an identical size to N(x). 
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3.1 Software Implementations on a t-Bit Processor Architecture 

For the sake of clearness, when comparing with existing implementations, we 
will suppose that p = pa = Pn- The detailed implementation of the modular 
multiplication is shown in figure 4. 

In this figure, a “(Hi, Lo)” is a 2t-bit register (such a register is common on 
RISC architectures providing atx t-bit multiplication with a 2t-bit result) which 
is the concatenation of registers Hi and Lo (both t-bit registers), where Hi is the 
upper part (most significant bits) of that register and Lo, the lower part. The 
expression “(Hi, Lo) ^ t” means that this virtual register is shifted left of t-bit. 
In other words, the result is Hi = 0 and Lo = Hi, the old value of Lo being 
discarded. 

In figure 4, a ‘©’ represents a bitwise XDR operation and a ‘©’ means a 
polynomial multiplication in IF 2 [x] where the polynomials have at most a degree 
of t—1. In other words, a computation like (Hi, Lo)© = A®B is simply a multiply 
and accumulate calculation, just like the one present on most RISC processors 
and DSP’s, but with the internal carries in multiplications and additions being 
disabled. An algorithmic representation of this computation is shown in figure 
3. Such a calculation is already implemented as an instruction in some high-end 
smart card microprocessors to improve elliptic curve computations in GF(2^). 



1 : for i = 0 to t—1 

2 : (Hi, Lo) = (Hi, Lo) © ((A • ((R » i) AND 1)) < i) 



3 : end for i 



zth bit of B 



Fig. 3. Simple program simulating the computation of (Hi, Lo)© = A® B. 

Ideally, Usup in lines (0) and (0) of figure 4, can be replaced by Lo, only if the 
Most Significant Bit (MSB) of A[p— 1] corresponds to the upper most significant 
coefficient of N{x). Otherwise Usup = {Lo <C fc) © {Rs ^ {t — k)), where k is 
the shift value that would be needed to align the most significant coefficient of 
N{x) stored in N[p — 1] with the MSB of N[p — 1]. 

Figure 4 also shows the number of multiply and accumulate instructions 
without carries (column ‘#©’) and the number of memory accesses. In compar- 
ison with the paper of Kog and Acar [I] we have exactly the same number of 
multiplications without carries but without the additional XDR operations. To be 
correct, most of the XDR are included in our multiply and accumulate operation. 



Improved implementation. The pseudo-code in figure 4 can still be com- 
pacted by computing AiB interleaved with QN in the reverse order as shown in 
figure 5 (computing first AiB[p—l] and 1]). This was made possible since 

there is no carry propagation when working in IF 2 [x] (compared with the com- 
putations over integers). Aligning the upper coefficient of N{x) with the upper 
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1 
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U[f) - 1] = Lo 
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TOTAL 


2p2+p 


3p2+p 


p2 +p 



Fig. 4. Interleaved modular multiplication in IF 2 [x]. 



bit of N[p — 1], when storing N{x) in the iV[j]’s (before staring the computa- 
tions), also simplifies the computations as shown in figure 5. This only requires 
one additional adaptation (right final shift) to the final result if the modulus is 
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constant during a whole set of modular multiplications. This is exactly the case 
with most cryptographic algorithms (e.g. ECDSA over GF(2^’) [4]). 







#LOAD 


#STORE 


1 


for j = 0 to p — 1 








2 


o 

II 






P 


3 
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end for i 
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2p^ +p 


3p^ +p 


p^ +p 



Fig. 5. Interleaved modular multiplication with internal loop starting in the reverse 
order. 

In this case (figure 5), Usup is simply {{U[p—l],U[p — 2])(BA[i]®B[p—l]) ^ 
(t — 1). Indeed, there is no influence of A[i\® B[p — 2] on the required upper part 
of U{x) as this only influences the (t — 1) first bits of {U[p — l],U[p — 2]) ® A[i]® 
B[p— 1]) and there is no carry propagation. Another consequence and advantage 
of such a computation is that there is no more need to compute Ap-i{x)B{x) in 
advance. No additional final reduction by Q{x)N{x) is necessary such that the 
lines (1) to (0) and (0) to (0) in figure 4 are no more necessary in figure 5. 

As shown in figure 5, the global number of operations is identical to the 
one in figure 4, but the code size is smaller. Except for the last reason, the 
choice between the two implementations will only depend on the (processor’s) 
architecture. 
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3.2 Comparison with Montgomery’s Modular Multiplication 

In figure 6, Montgomery’s modular multiplication is implemented in the same 
way as our method. The main difference in our implementation compared to Kog 
and Acar’s one [1] is the merging between the multiplication and the reduction 
phase of the algorithm to reduce the memory accesses. The number of memory 
accesses is smaller in our case as compared to the one required by the Mont- 
gomery’s method described by Kog and Acar. Their method needs (6s^ — s) load 
and (3s^ -I- 2s -I- 1) store operations as given in table 2 of their paper. 

As shown in figures 4, 5 and 6, our method is similar to Montgomery’s one in 
terms of the number of multiply and accumulate without carries and the number 
of memory accesses. 
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Fig. 6. Interleaved Montgomery’s modular multiplication in IF 2 [a;]. 



However, a possible inconvenience of our method (for a software implementa- 
tion on a general purpose processor), as compared to Montgomery’s one, would 
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be the slower “extraction” of Usup from the intermediate values of U {x) as well 
as the right shift by {t — 1) when computing Q. This disadvantage can be simply 
taken into account, with a very minimal cost, in a hardware implementation. 

The main disadvantage of Montgomery’s method is not visible in figure 6. In- 
deed, our method exactly computes A{x)B{x) mod N{x) which is not the case for 
Montgomery’s one which computes A{x)B{x)x~p mod N{x) [1]. This last com- 
putation is mostly not desired. This is why Montgomery’s method often requires 
substantial additional code to deal with that computation (e.g replace A{x) by 
A{x)x^ mod N{x) and B{x) by B{x)x^ mod N{x) before the computations and 
then finish the computations by multiplying the final result by ‘1’ using Mont- 
gomery’s method). These computations penalize Montgomery’s method in terms 
of code size (it can be critical in smart card’s context) and may complicate the 
use of the Montgomery’s method outside the scope of modular exponentiations 
computations (e.g. for a single modular multiplication). 



3.3 Speed Comparisons on a Real Implementation 

Table 1 shows the results, in terms of clock cycles, obtained by both Mont- 
gomery’s multiplication (figure 6) and the two versions of our method described 
in figures 4 and 5. 





256-bit mnltiplication 


512-bit multiplication 


Algorithm in fig. 4 


910 


3230 


Compact Algorithm in fig. 5 


812 


3028 


Montgomery’s method (fig. 6) 


756 


2916 



Table 1. Algorithms’ speed in clock cycles on a Montgomery’s optimized processor. 



Measurements were done on a 32-bit processor’s (usable in smart cards) 
simulator using the modified multiply accumulate instruction (without internal 
carries) as described in section 3.1. This processor was designed to speed up 
Montgomery’s multiplications. This explains why, in table 1, the Montgomery’s 
method has still an advantage. As explained before, this is only due to the small 
additional computations required for computing Q{x) in our implementation. 
Indeed, 7 additional clock cycles for each Q{x) computation (equivalent to line 
(8) in figure 5) are required as compared to what is done in the Montgomery’s 
implementation (line (9) in figure 6). However, similar results for both implemen- 
tations can be obtained by very slightly modifying a few processor’s instructions. 

The comparison made here only involves the “core’s” modular multiplication 
itself. As explained in section 3.2, the fact that Montgomery’s method computes 
A{x)B{x)x~P mod N{x) in place of the exact value (A{x)B{x) mod N{x)) for 
our method, can also deeply influence the choice between one algorithm and the 
other. 
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4 Conclusions 

We have first extended the generalized Barrett’s modular reduction to IFg[a;]. We 
then described an efficient way to implement fast “left to right” modular multi- 
plication in IF 2 [a;] which is at least as efficient as the best known methods, namely 
Montgomery’s multiplication [1]. Furthermore, our method has the advantage of 
computing the modular multiplication without the inconvenience of a normal- 
ization as needed in the Montgomery’s one. This makes this method particularly 
attractive for smart cards and hardware implementations. A way of reducing the 
memory accesses in both methods has been described. Both methods can be ef- 
ficiently implemented by having a multiply and accumulate instruction without 
internal carries to perform fast competitive software implementations of elliptic 
curve crypto-systems in GF(2^). 
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Abstract. A novel technique for computing a 2n-bit modular multipli- 
cation using n-bit arithmetic was introduced at CHES 2002 by Fischer 
and Seifert. Their technique makes use of an Euclidean division based 
instruction returning not only the remainder but also the integer quo- 
tient resulting from a modular multiplication, i.e., on input x,y and 2 , 
both lxy/z\ and xy mod 2 are returned. A second algorithm making 
use of a special modular ‘multiply-and-accumulate’ instruction was also 
proposed. 

In this paper, we improve on these algorithms and propose more ad- 
vanced computational strategies with fewer calls to these basic opera- 
tions, bringing in a speed-up factor up to 57%. Besides, when Euclidean 
multiplications themselves have to be emulated in software, we propose a 
specific modular multiplication based algorithm which surpasses original 
algorithms in performance by 71%. 

Keywords: Modular multiplication, crypto-processors, embedded cryp- 
tographic software, efficient implementations, RSA. 



1 Introduction 

When a cryptographic coprocessor is inherently limited to handle numbers of a 
specific bitsize n, performing modular arithmetic operations over larger operands 
turns out to be an intricate implementation problem. One may think of natural 
and simple solutions like programming multi-precision algorithms such as those 
of Montgomery [5], Barrett [3], Quisquater [8] or Walter [10]. These algorithms 
as well as others, however, require processing data blocks via smaller operations 
that may not be supported by the underlying hardware architecture. A typical 
example lies in the regular (nxn)-bit integer multiplication (with a 2n-bit result), 
which may not be directly available on a crypto-processor like Infineon’s ACE 
where only n-bit modular operations — for n up to 1100 — are programmable. To 
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remedy this, a conventional trick tells us to take a blocksize n < 1100/2 (say n = 
512), so that integer multiplications result from multiplications modulo 2^^™. 
Adopting this strategy, a Montgomery-based implementation for a single 2048- 
bit multiplication would cost no less than forty 512-bit integer multiplications, 
an unacceptable performance. So implementing a 2048-bit RSA while sustaining 
higher expectations in terms of execution speed is not a straightforward task on 
this platform. Given that context, one has to devise more specific techniques to 
emulate modular arithmetic operations over operands of larger sizes. 

Surprisingly enough, little is known about software strategies that would 
overcome this hardware-originated length limitation in a very efficient way. Two 
different techniques, however, have appeared in the literature recently. In [6], 
Paillier presents a 2n-bit modular multiplication emulated with 8 — >■ 6 calls to a 
modular multiplier of bitsize n (plus other, negligible operations). Paillier’s algo- 
rithm, inspired by Montgomery’s technique [5], strongly relies on Residue Num- 
ber Systems (RNS) [7,9] for representing data and performing partial operations 
on them. It simplifies earlier, more intricate approaches making use of mixed base 
representations [1,2]. The efficiency of this system is due to the use of fast base 
extensions in connection with a specific choice for the RNS base, a choice which 
also ensures that the result of the double-size multiplication is returned under a 
representation compatible with the input operands themselves, thereby allowing 
repeated invocations of the algorithm. Unfortunately, its Montgomery-like style 
forces one to precompute a modulus-dependent constant prior to multiplying 
any data. 

More recently, in [4], Fischer and Seifert suppress the need for precomputed 
constants: 2n-bit operands are handled through a classical radix representation 
with base 2”, the new technique outputting a result under the same representa- 
tion. Independently, in this work, Fischer and Seifert replace the basic operation 
with an Euclidean multiplication, i.e., an operation that simultaneously returns 
both the quotient and remainder of the division xy -F z, given arbitrary n-bit 
integers x, y and z. The motivation for this stems from the ease of integrating 
such an operation in a hardware architecture which already supports modular 
multiplications. On most architectures indeed, the arithmetic units involved in 
the execution of a modular reduction could be easily enriched to simultaneously 
output quotient bits with extremely moderate extra cost. 

In this paper, we improve on Fischer and Seifert’s algorithms and propose 
more advanced computational strategies with fewer calls to the Euclidean mul- 
tiplication. Our improved algorithms use 2"-radix representations of numbers in 
the spirit of [4] . We also show that adapting the choice of the radix base accord- 
ing to the modulus may further speed up our technique. This modification can be 
carried out while maintaining inputs and outputs under the same arithmetic for- 
mat, thereby making it possible to iterate executions. In the most favorable case, 
we emulate a double-size modular multiplication with no more than 3 Euclidean 
multiplications, which leads to a speedup factor of « 57%. In addition, 
when Euclidean multiplications themselves must be emulated in software from 
modular multiplications, we show how to use these directly without referring 
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to RNS-based approaches [1,2, 7, 9]. More precisely, we propose a simple alterna- 
tive to these works which keeps numbers under a radix representation and runs 
as fast as 2 Euclidean multiplications. This accelerates original algorithms by 
« 71%. We remind that, whatever the computational strategy, it is agreed 
that n-bit linear operations such as signed additions, subtractions, xors, con- 
ditional branchings and so forth, are always available and that their respective 
running times remain negligible in comparison with operations of multiplicative 
nature. As usual, we consider these as being virtually free operations throughout 
the paper. 

The rest of this paper is organized as follows. In the next section, we review 
the technique introduced by Fischer and Seifert for emulating a 2n-bit modular 
multiplication and show how to improve it in Section 3. Then, in Section 4, we 
investigate the influence of data representations on the performances of our algo- 
rithms. In Section 5, we detail an implementation of an Euclidean multiplication. 
Section 6 describes a specific strategy for cases when Euclidean multiplications 
are emulated in software. Finally, we summarize and compare our results in 
Section 7. 

2 Fischer and Seifert’s Algorithms 

Fischer and Seifert’s technique [4] relies on the two basic instructions 

MultModDiv(x, y, z) = {\_{x ■ y)/z\,{x ■ y) mod z) (1) 

and 

MultModDivInit(x, y, t,z) = { [(a; ■ y + t ■ 2”) /z\ ,{x-y + t- 2") mod z ) , (2) 

where x, y, t, z are n-bit integers. It is implicitly required in [4] that operands x, y 
and t can be negative, i.e., that the processor is able to handle them whatever 
their sign through a signed representation without affecting computation results. 
In fact, these two instructions should, by extension, work for any non-reduced 
inputs x,y,t, namely whenever jccj > z for instance, provided that |"|a:|/z] re- 
mains an extremely small value. Subsequent hardware or software corrections 
are neglected in the description of all algorithms, as proposed in [4]. 

The algorithms originally proposed by Fischer and Seifert, which we denote 
by FSl and FS2, are depicted on Fig. 1 and Fig. 2, respectively. We refer the 
reader to [4] for proofs of correctness. 

3 Improved Algorithms 

Our idea consists in rewriting the modular multiplication in terms of manipula- 
tions over half-size operands. This is reminiscent of Karatsuba’s famous method 
which we recall here for the sake of completeness. 
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Input: 2n-bit integers A = A\2'^ + Aq, B = B\2'^ + Bq, N = Ni2" -\- No 

Output: AB (mod N) 

Cost: 7 MultModDiv 

= MultModDiv(Bi, 2", iVi) 

= MultModDiv(Q(i\Aio,2") 

= MultModDiv(Ai,i7(i^ -b Bo, Ail) 

= MultModDiv(Ao,Bi,Aii) 

= MultModDiv(Q(3) -b Aio, 2") 

(Q(®), = MultModDiv(Ai, 2") 

= MultModDiv(Ao, Bo, 2") 

Return -b R^*^ - Q^^'> - + Q<’^^)2" -b 



Fig. 1. Fischer-Seifert’s modular multiplication algorithm FSl 



Input: 2n-bit integers A = Mi2" -b Aq, B = Bi2" -b -Bq, Ai = Aii2" -b No 

Output: blB (mod N) 

Cost: 5 MultModDiv -b 1 MultModDivInit 

(Q‘-^\R^^^) = MultModDiv(Ai,Bi,Aii) 

= MultModDivInit(Aio,-Q'^^B'^\Aii) 

= MultModDiv(Ai,Bo,Aii) 

= MultModDiv(blo,Bi,Aii) 

(Q^®^ B^®^) = MultModDiv(Ao, Bo, 2") 

(Q(®), B<®)) = MultModDiv(Q( 2 ) -b -b Q^^\No, 2") 

Return (B<^> -b B<®> -b B^^^ -b Q^®^ - Q^®>)2" -b (B<®^ - B^®^) 



Fig. 2. Fischer-Seifert’s modular multiplication algorithm FS2 



Lemma 1 (Karatsuba). If A = Ai2^ + B = Bi2” -b Bo then 

AB = 2”(2” - l)bliBi + 2”(Ai + Ao){B^ + Bo) - (2” - l)zloSo . 

Our first algorithm only makes use of Fischer and Seifert’s MultModDiv in- 
struction while our second algorithm also employs the MultModDivInit instruc- 
tion. 



3.1 Using MultModDiv Instructions Only 

In this section, we eliminate a MultModDiv instruction in Fischer-Seifert’s tech- 
nique. We state: 

Theorem 1. Given arbitrary 2n-hit integers N and A,B<N, the 2n-hit inte- 
ger AB mod N can he computed with at most six n-bit MultModDiv instructions. 
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Input: 2n-bit integers A = A\2" + Ao, B = B\2" + Bo, N = Ni2'^ + No 

Output: AB (mod N) 

Cost: 6 MultModDiv 

= MultModDiv(Ai,Bi,Afi) 

= MultModDiv(Q(i\ATo,2") 

= MultModDiv(Ai + Ao, Bi + Bo, 2" - 1) 

= MultModDiv(Ao, Bq, 2") 

(Q^^\R^^'>) = MultModDiv(2" - +Q^^'> - -Q^^^TVi) 

= MultModDiv(Q(®\iVo,2") 

Return - Q(®> - _ i?(4))2" + (r( 2) + i?(4) _ i^(6)) 

Fig. 3. Our improved algorithm A1 for double-size modular multiplication 



Our algorithm, denoted Al, is described hereafter on Fig. 3. 

Proof (of correctness for AlJ. For convenience, we write Z = 2" and denote by 
=jv the equivalences modulo N. Then, rewriting Lemma 1 gives 

AB = Z{Z - l)AiBi + Z{Ai + Ao){Bi + Bq) - {Z - l)AoBo . 

Moreover, noticing that N\Z =m —Nq, we get 

Z{Z - l)AiBi =N Z{Z - l)(Q(^^iVi -b 

=N -{Z - l)(Q«iVo) + Z(Z - 
=N ~{Z - 1 )(Q( 2 )^ + jiiA) + z{Z - 
= N Z{Z - - g(2)) _ ^ 

Z{Ai + Ao){Bi + Bo) = Z{{Z - 1)Q(3) + = z{Z - l)g(3) + 

and 

{z - i)AoBo = {z- i)(zg(‘‘) + = z{z - i)g(4) + {z- i)i?w . 

Hence, we have 

AB =N Z{z - -b Q^^'> - g(2^ - g('‘)) -b - (Z - l)(i?(2) + 

=N Z{Q‘^^'>Ni + -b - (Z - l)(i?(2) + rA)) 

=N -Q^^^No + Z{R^^^ + - (Z - l)(i?(2) + rA)) 

=N -(g^®^^ + + Z{R^^'> + R^^'>) - (Z - l)(i?(2) + rA)) 

=N - g^®^ - R^‘^'> - R'^^'>)Z + (i?(2) + rA) _ rA)) ^ 

which proves the correctness of Algorithm Al and of Theorem 1. □ 
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3.2 Using MultModDiv and MultModDivInit Instructions 

Here again, we invoke Lemma 1 and improve Fischer and Seifert’s double size 
multiplier FS2. Formally, we state: 

Theorem 2. Given arbitrary 2n-bit integers N and A,B<N, the 2n-bit inte- 
ger AB mod N can be computed with at most four n-bit MultModDiv instructions 
and one n-bit MultModDivInit instruction. 

Our algorithm, denoted A2, is described below on Fig. 4. 



Input: 2n-bit integers A = A\2" -\- Aq, B = -\- Bo, N = Ni2" -\- No 

Output: AB (mod N) 

Cost: 4 MultModDiv -f 1 MultModDivInit 

= MultModDiv(Hi,Bi,iVi) 

= MultModDiv(Hi -t Ao, Bi -b Bo, 2" - 1) 

= MultModDiv(Ho, Bo, 2") 

(Q^^\R^^'>) = MultModDivInit(Q<b, iVo, q(3) _ rW _ Ni) 

= MultModDiv(iVo -b A^i, 2") 

Return (R^^^ -b - R^^^ - R^*^)2^ -b -b R^^^ -b 



Fig. 4. Improved algorithm A2 for double-size modular multiplication 



Proof (of correctness for A2). As before, we set Z = 2". We have 

Z{Z - =N Z{Z - l)(Q(i)Aii + =N {Z - l)(-Q(i)Aio + R^^^Z ) , 

Z{Ai + Ao){Bi + Bo) = Z{{Z - l)g(2) + = (z - 1)Q(2 )z + Zi?(2) , 

(Z - l)AoHo = (Z - 1)(ZQ(3) + rG)) = (z - 1)Q(3)z -b (Z - 

so that 

AB =N -(Z - l)(Q(i)Aio + - R^^^ - Q^^^)Z) + ZR^^^ - (Z - 

=N -(Z - l)(g^‘‘^A^i -b + ZR^^'^ - (Z - 

=N {No + Ni)Q(N + zA-( 2) - (Z - l)(i?(3) + i?(4)) 

=N (g^^^z -b -b zi?(2) _ (z - i)(i?(3) + i?(4)) 

=N + g^®^ - R^^~> - R^*'^)z + + rA) + rA)^ 

since Ni{Z — 1) =m —No — Ni. □ 



4 Further Improvements Using Specific Representations 

All algorithms considered so far manipulate integers in radix representation with 
base 2". We now show how changing that representation may lead to further 
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cost savings in our algorithms. Although we explicitly describe only a couple 
of (modulus-dependent) representations in what follows, there might exist other 
ones which would reveal quite as efficient. In both cases, the idea is simply to 
employ a clever representation base derived from modulus N. This computation 
is performed prior to the execution of the corresponding double-size modular 
multiplication and can be executed once and for all, especially when the multi- 
plication is invoked repeatedly. 

4.1 Down to 5 MultModDiv 

Let X = . Then setting a = mod N, we have a < 2X. We state: 

Theorem 3. Given arbitrary 2n-hit integers N and A,B<N, the 2n-hit inte- 
ger AB mod N can he computed with at most five n-bit MultModDiv instructions. 

We denote our new algorithm by A3 and describe it on Fig. 5. 



Input: radix base X, 2n-bit integers A — AiX ->r Aq, B — BiX Bq 

Output: AB (mod N) 

Cost: 5 MultModDiv 

= MultModDiv(Ao,So, A) 

= MultModDiv) Ai Aq, Bi B q, X) 

) = MultModDiv) Ai , , A) 

(Q(^^ , R^‘^'> ) = MultModDiv(a, Q^^\X) 

) = MultModDiv(a, -Q^) + g( 2 ) _ g(3) g(4) ^(3) ^ 

Return (R^^^ -t- R^^i) -t- (A^ _ g(i) j^(2) _ jj(3) _f_ g(5))x 



Fig. 5. Double-size modular multiplication algorithm A3 



Proof (of Algorithm A3). Using A^ =n a and 

AB = A(A - l)AiBi + A(Ai + Ao)(Bi + Bo) - (A - l)AoBo , 

we get 

(A - l)AoBo =N {X - l)(g(i)A + A-(i)) =N - q(B)x , 

A(Ai + Ao)(Bi + Bo) =N A(g(2)x + rG)) g(2)« + rG)x , 

and 

XAiBi =N A(g(3)x -b =N A , 

A^AiBi =N A(i?(^)A -b =N R^^^a + Q^^'^aX , 

A(A - l)AiBi =N (-g(^) + + Q^^^a)X , 



where- from 
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AB =N _ g(3) + ^0)) ^(i) 

-b -b - R^^'> + 

=N + g(2) _ g(3) + i^(3)) + ^(1) 

+x{Q<~^'>x + -b g(^) -b 

a(-g(i) + g(2) - g(3) + qA) + i?(3)) + rW 

-bx(i?w - -b g(^) -b 

=N + R^^^) + - R^^^ + + R^^'> - R^^^> + Q^^^)X , 

which proves the correctness of A3. □ 

Again, this algorithm uses only 5 MultModDiv instructions. But if we take a 
careful look at its description, we observe that a couple of MultModDiv instruc- 
tions are performed directly with operand a. Therefore, having a small value for 
a would render these two MultModDiv instructions significantly faster. Suppose 
for example that an n-bit X can be found given N such that a < 2"/^. Assum- 
ing that the execution time of MultModDiv is essentially linear in the bitsize of 
its first operand, then Algorithm A3 would have a time consumption close to 4 
MultModDiv, resulting in an additional speedup of 20%. 

4.2 Extreme Cases: Down to 3 MultModDiv 

Optimal performances are reached when a = —1,2,3 for instance, in which cases 
the computational cost of our algorithm reduces to 3 MultModDiv instructions. 
One may of course ask under which circumstances there exists an n-bit integer 
X with such a trivial square modulo a 2n-bit RSA modulus N . A practical way 
to ensure this consists in modifying the RSA key generation. We believe that 
simple algebraic techniques allow to do that while preserving the security of RSA 
moduli. 

Other choices for the representation base may also present interesting prop- 
erties, as we now illustrate. Assume for instance that for a given N , there exists 
an n-bit Y > such that 

Y“^ = a + SY (mod N ) , 

where we try to make a and 8 as trivial as possible. If a and <5 are simple numbers 
(ideally 8 = 1), A3 simplifies into Algorithm A4 depicted on Fig. 6. 

Proof (of correctness for AA). Using =tv a + 8Y and Lemma 1, one gets 

(F - l)AoSo =N (F - l)(g(i)F + 

=N -g^^^F - + R(i)F + g(i)(a + 8Y) 

=N g^^^a - + {R^^^ - g^^^ + g(^)<5)F , 

F(Ai + Ao)(Ri + Ro) =N F(g(2)r + 

=N i?(^)F + g(2)(a + ^y) 

=N g^^)a+(R(2) + g(2)j)F, 
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Input: radix base Y, 2n-bit integers A = AiY + Ao, B = B±Y + Bq 

Output: AB (mod N) 

Cost: 3 MultModDiv 

= MultModDiv(Ao, Bo, y) 

= MultModDiv(Ai +^o,Bi + Bo,y) 

(Q^^\R^^'>) = MultModDiv(Ai,Bi,F) 

Return - Q^^'> + R^^'> + + R‘-^'> 

+ - R^^'> + + Q(^)(q + S^) + + R<-^'> - 



Fig. 6. Double-size modular multiplication algorithm A4 



y^iBi =N y(Q(^)y + =n + Q^^'>5)Y 

Y^AiBi =N Y{Q(-^^a + + g(3)j)y) 

=N Q^^^aY + (i?(3) + Q(3)j)(a + SY) 

=N + Q^^^S)a + + Q^^^S)S + g®or)y , 

Y{Y - l)AiBi =N + Q‘^^'>S)a 

+ ((i?(3) + g(3)j)(<5 - 1) + g(3)a)y 
- g(3) + g(3)j)a -b (-i?(3) + (_g(3) + rA))s 
+ Q^^HS^ + a))Y , 

so that 



AB =N a{-Q‘^^'> + g(2) - g(3) + i^(3) q(3)s) + i?(i) 

+ Y{-R^^'> - i?(3) + g(i) + i^(2) + q(3)(q, ^ 2 ) 

+ (-g(3) + i?(3) _ g(i) + g(2))j) , 

thereby proving Algorithm A4. □ 

Again, this algorithm has a cost of 3 MultModDiv instructions provided that 
the values for a, S and a+<5^ are simple constant numbers. This could be ensured 
by properly adapting the RSA key generation algorithm. 

5 Emulating Euclidean Multiplications 

When the Euclidean multiplication itself is not directly available in hardware, it 
can be emulated easily with a couple of modular multiplications. The quotient 
of MultModDiv in [4] is calculated from the remainders of a; • y modulo z and 
modulo {z + 1). However, such a situation is most unfortunate for fast modular 
multiplication algorithms based on Montgomery’s technique as either z or z -|- 1 
is even. Although extensions of Montgomery to even moduli exist, we suggest a 
simple alternative hereafter. Our method is based on the next lemma. 




Faster Double-Size Modular Multiplication from Euclidean Multipliers 



223 



Lemma 2. If 0 < xy < {z — 1)^ then 



xy 


< 


^ < 


xy 


Vz + p\ 


lz + p\ 




. z \ ~ 



+ P 



for any nonnegative p. 



Proof. Since z < z + P, it follows that xyj{z + P) < xyfz and consequently 
\xy/{z + P)\ < \xy/z\. For the second inequality, we observe that 



z 




P\ ^ xy 
z) - z + P 



{z - irp 

(z + p)z 



xy 

z + P 



+ P ■ 



Therefore, we get \xy/z\ < \xy/{z + P)\ + \P'\ = \_xy/{z + P)\ + p. □ 



So, letting 



A0 = 



xy 




xy 


- Z - 




tz + p\ 



and C /3 = xy mod {z + P) , 



(with 0 < < /3 by Lemma 1), one expresses the integer quotient resulting 

from the modular multiplication, C = xy mod z, as 



xy 
. z - 



C-C^-Ap{z + P) 

P 



( 3 ) 



Proof. By definition, we have xy = [xy/z\ z + C = \_xy/{z + /3)J {z + P) +Cp = 
{\xy/z\ + Ai 3 ){z + P) + Cf 3 , which implies C = \xy/z\P + Ap{z + P) + C^. □ 



In particular, the value P = 2 yields the integer quotient from two modular 
reductions with moduli having the same parity as z. Carrying out a division by 
(3 is inexpensive as it amounts to a shift of a single bit to the right. Finally, since 
A 2 < 2, there are (at most) only two negligible corrections to make to get the 
exact value of the quotient. 



Remark 1. This method readily extends for any value of /3; the powers of 2 are 
of particular interest. Note also that a way to lower the expected error (cf. Ap) 
consists in increasing the numerator in \ xy/{z + P)\. 



6 A Modular Multiplication Based Algorithm 

When Euclidean multiplications are emulated in software from modular multi- 
plications, one may wonder if using these directly could yield faster algorithms 
without necessarily coming back to RNS-based approaches [1,2, 7, 9]. In this sec- 
tion, we propose a simple alternative to these works that keeps numbers under 
a radix representation. We rely on the following lemma. 
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Lemma 3. Let X be an n-bit odd integer not divisible by 3 and N an arbitrary 
integer such that X > \ '/N ] . There exists an algorithm which, given any A = 
AiX + Aq and B = BiX + Bq such that A, B < N outputs the representation 

AB = CsX^ + C2X‘^ + CiX + Co 
in at most four n-bit modular multiplications. 

We refer the reader to Appendix A for a description of such an algorithm, 
which we denote by Coefficients in the sequel. 

Now, very much in the spirit of Section 4.1, we precompute X such that 
X > [\/iV] and set a = mod X . Here however, as we need gcd(A, 6) = 1, 
we try out X = [\/fciV] for increasing values of k = 1, . . . , until X is found 
odd and coprime to 3. Even if |A| exceeds n, the difference \X\ — n will be a 
very small value in any case, and we refer to the fact that we are able to work 
with non-reduced numbers when they do not exceed their range too much (see 
Section 2). Relying on Lemma 3, we devise Algorithm A5 as shown on Fig. 7. 



Input: radix base X, 2n-bit integers A = A\X + Aq, B = B\X + Bo 

Output: AB (mod N) 

Cost: 2 MultModDiv + 1 Coefficients 

(Cd),Ed), = Coefficients(A,B, A) 

= MultModDiv(a, U^^'> , X) 

= MultModDiv(a, +Q(^\A) 

Return ^/(i) Q(3))jy 



Fig. 7. Double-size modular multiplication algorithm A5 



Proof (of correctness for A5). By definition, 

AB = + + W^^'>X + 

=N + W^^^)X 

=N R^^^ + v^^'>a + (g(2)x + i?(2) 

=N R^^^ + -b W(1))A 

=N R^^^ + R^^^ + -b g(^))A , 

thereby validating A5. □ 

As indicated, our algorithm runs two MultModDiv and one Coefficients 
operations, which (relying on Section 5 or [4]) yields 8 modular multiplications 
among which 4 are executed with operand a. Then, we can combine A5 with a 
proper modification of the RSA key generator to ensure that a is some small 
(absolute) constant. In this context of use, the cost of a double size modular 
multiplication by A5 reduces to four n-bit multiplications only, i.e., becomes 
computationally equivalent to 2 calls to MultModDiv thereby yielding a speedup 
factor of (14 — 4)/14 « 71% in comparison with FSl. 
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7 Conclusion 

In this paper, we showed how to optimally reduce the cost of Fischer and Seifert 
double-size modular multiplications, provided that the same basic operation (Eu- 
clidean multiplication) is available. We highlighted the role of the data represen- 
tation towards the performance of emulated multiplications and proposed new 
ones featuring dramatic cost savings. 



Table 1. Number of calls in double-size modular multiplication algorithms. The last 
line displays the number of equivalent n-bit modular multiplications 



Calls 


Fischer-Seifert 
FSl FS2 


A1 


A2 


Our algorithms 
A3 


A4 


A5 


MultModDiv 


7 


5 


6 


4 


5 3 


3 


2-^0 


MultModDivInit 


0 


1 


0 


1 


0 


0 


0 


Coefficients 


0 


0 


0 


0 


0 


0 


1 


Equiv. MultMod 


14 


12 


12 


10 


10-> 6 


6 


8-> 4 



We stress that in each and every of our algorithms, modifications of the 
radix base can be carried out while maintaining inputs and outputs under the 
same arithmetic format, which allows repeated executions with the same modu- 
lus. Naturally, the same algorithms may readily be used to perform double-size 
modular squarings. In the most favorable case, we emulate a double-size modu- 
lar multiplication with no more than 3 Euclidean multiplications, resulting in a 
speedup factor of 57% in comparison with Fischer and Seifert’s original proce- 
dures, as indicated in Table 1. When Euclidean multiplications cannot be carried 
out in hardware, we provide a variation based on modular multiplications only 
which surpasses original algorithms in performance by 71%. Although we doubt 
the existence of more advanced yet simple techniques, we challenge the crypto- 
graphic community for better results. 
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A Proof of Lemma 3 

Our four modular multiplications will be 

{ Rq = {A mod X){B mod Jf) mod X , 

Ri = {A mod {X -\- 1)){B mod {X J- 1)) mod (Jf J- 1) , 
i?2 = {A mod {X -\- 2)){B mod {X J- 2)) mod {X -\- 2) , 
i?3 = {A mod {2X + 3)){B mod {2X + 3)) mod {2X + 3) . 

In what follows, we use the notations 

fco = AB div X , ki = AB div {X J- 1) , 

fca = AB div (X + 2) , kz = AB div {2X + 3) , 

and we have by definition 

AB = koX -\- Rq = ki{X J- 1) -I- Ri = ko{X J- 1) -I- {Rq — fco) 

= k 2 {X J- 2) J- i?2 = ko{X J- 2) J- {Rq — 2fco) , 

2AB = 2koX + 2Ro = 2kz{2X -b 3) -b 2Rq = fco(2X -b 3) - 3fco + 2Rq , 



ko = Rq — Ri (mod {X -b 1)) 

2fco = Rq — R 2 (mod {X -\- 2)) 

3fco = 2(i?o - ^ 3 ) (mod (2X-b3)) . 



so that 
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Since X is coprime to 3, if we call a = (i?o — R2 + {{Ro ~ R2) mod 2){X + 
2))/2 mod X + 2, b = Ro - Ri mod {X + 1) and c = (2(i?o - R3) + (2(i?o - 
R3) mod 3)(2X-|-3))/3 mod (2X-I-3), we get that ko = b (mod (X-l-1)), ko = a 
(mod {X -1-2)) and ko = c (mod {2X -1-3)). Starting from these equations, we 
can perform Chinese remaindering, because X-l-1, X + 2 and 2X + 3 are pairwise 
relatively prime: 

ko mod {{X + 1)(X -I- 2)) = ((6 — a) mod {X + 1))(^ -I- 2) -|- a . 

Letting d = {{b — a) mod {X + 1)) and e = a -I- 2d, we have 
ko mod ((X -b 1)(X + 2)) = dX + e . 

Moreover, remarking that {X + 1)(^ + 2)(— 4) = 1 (mod {2X + 3)) and letting 
/ = —6d -b 4e — 4c mod {2X -b 3), we notice that the second CRT recombination 

ko = [-4(c -dX-e) mod {2X + 3)](X -b 1)(X + 2) + dX + e 

= [2d{2X -b 3) - 6d-b 4e - 4c) mod {2X + 3)](X -b 1)(X + 2)+dX + e 
= f{X -b 1)(^ -b 2) -b dX -b e 

is easily rewritten as ko = fX'^ + {d + 3f)X + {e + 2f). Consequently, C = AB is 
computed in 4 modular multiplications as C = C^X^ + C2X'^ + C\X + Co with 

Co = f, 

C2 = d + 3f , 

Cl = e -b 2/ , 

Co = Ro ■ 

Note that these operations are not of size 2n x 2n modulo n but of size n x n 
modulo n, because, from A = AiX + Ao and B = B\X + Bo, Rq, Ri, R 2 and 
i ?3 can be computed as 

Ro = AoBo mod X , 

Ri = (^0 ~ ^i)(^o ~ ^ 1 ) mod {X + 1) , 
i ?2 = (^0 - 2^i)(^0 - 2Bi) mod (X + 2) , 
i ?3 = (bio + {Ai mod 2)X — 3(zli div 2)) 

x{Bo + {Bi mod 2)X - 3{Bi div 2)) mod {2X + 3) . 



As before, the cost of auxiliary operations (additions, subtractions, parity bits, 
etc.) is neglected. □ 
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Abstract. We present a fast and compact hardware architecture of 
exponentiation in a finite field GF{2") determined by a Gauss period of 
type (n, k) with k > 2. Our construction is based on the ideas of Gao 
et al. and on the computational evidence that a Gauss period of type 
(n, k) over GF{2) is very often primitive when k > 2. Also in the case 
of a Gauss period of type (n, 1), i.e. a type I optimal normal element, 
we find a primitive element in GC(2") which is a sparse polynomial of 
a type I optimal normal element and we propose a fast exponentiation 
algorithm which is applicable for both software and hardware purposes. 
We give an explicit hardware design using the algorithm. 

Keywords: Finite field, Gauss period, primitive element, exponentia- 
tion, optimal normal basis 



1 Introduction 

Arithmetic of finite fields finds various applications in many cryptographic areas 
these days. Especially, fast exponentiation is very important in such applications 
as Diffie-Hellman key exchange and pseudo random bit generators. Though ex- 
ponentiation is the most time consuming and complex arithmetic operation, in 
some situations such as Diffie-Hellman key exchange, one can devise an efficient 
exponentiation algorithm since a fixed (primitive) element is raised to many dif- 
ferent powers. Let GF(g") be a finite field with g" element where g is a power of 
a prime and let g G GF{q'^) be a primitive element (or an element of high mul- 
tiplicative order). Roughly speaking, the computation of g® for arbitrary values 
of s is studied from two different directions. One is the use of precomputation 
with vector addition chains such as BGMW method [1] and its improvements 
by Lim and Lee [6] and also by Rooij [7]. The other approach is suggested by 
Gao et al. [4,5] and it uses a special primitive element called a Gauss period 
which generates a normal basis for GF{q") over GF{q). The BGMW method 
and its improvements are applicable to arbitrary finite field GF{q^) and very 
flexible. On the other hand, an ideal version of BGMW method requires a mem- 
ory of order 0(nlog q/ log(nlog q)) values in GF{q^) and multiplications of order 
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0(log(nlogg)) which amounts to an order of 0(n^ log^ glog(nlogg)) bit addi- 
tions. An algorithm proposed by Gao et al. is not applicable to all finite fields. 
However, it does not need a precomputation and the complexity of the algo- 
rithm is 0{kqn?) additions. Therefore if q is small and if there is a Gauss period 
of high order of type (n, k) for a small value of k, then the method of Gao et 
al. outperforms the precomputation methods. In this paper, we will discuss an 
improved algorithm of Gao et al. for a hardware arrangement. We will present a 
compact and fast hardware architecture for exponentiation using Gauss periods 
of type (n, k) in GF(2”) where k > 2, and detailed explanations will be given for 
A: = 2, 3. Also we will give an algorithm for efficient exponentiation in the field 
determined by an irreducible all one polynomial (AGP). This is possible since 
we may successfully find a primitive element which is a trinomial of a root of an 
AOP for most of the cases. Since none of the papers in [1,4,5, 6, 7] mentions an ex- 
plicit hardware architecture for exponentiation and since our construction of the 
circuit has the features of regularity and modularity for VLSI implementation, 
our result may have possible applications such as smart card purposes. 

2 Gauss Periods of Type (n,k) in GF{q^) 

We will briefly review the theory of Gauss periods and the method of Gao et al.. 
Let n, k be positive integers such that p = nk + 1 is a, prime not dividing q. Let 
K = (t) be a unique subgroup of order k in GF{p)^ . Let ord^q be the order of 
q modulo p and assume gcd{nk/ordpq,n) = 1. Let /3 be a primitive pth root of 
unity in GF{q"^). Then the the following element 

fe-i 

« = (1) 

j=0 

is called a Gauss period of type {n,k) over GF{q). It is well known that a is 
a normal element in GF{q'^). That is, letting at = for 0 < i < n — 1, 
{ao,ai,a 2 ,- " is a basis for GF(q^) over GF{q). Since K = (r) is a 

subgroup of order k in GF{p) ^ , a cyclic group of order p—l = nk, the quotient 
group GF{p) ^ / K is also a cyclic group of order n and the generator of the group 
is qK. Therefore we have a coset decomposition of GF{p)^ as a disjoint union, 

GF{py = KoUKi[JK2---UK^-i, ( 2 ) 

where Ki = g*AT, 0 < i < n — 1. Note that any element in GF{p)^ is uniquely 
written as r^q* for some 0 < s < A: — 1 and 0 < A < n — 1. Now for each 
0 < i < n — 1, we have 

0 

k-1 k-1 
s=0 t—0 



k-1 k-1 
s=0 



(3) 
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Notice that there is unique 0 < m < fc — 1 and 0 < w < n — 1 such that 
1 + = 0 G GF{p) . If t ^ u or z ^ u, then we have 1 + G Ka{t,i) for some 

0 < a{t, i) < n — 1 depending on t and i. Thus we may write 1 + = r* 

for some t' . Now when i ^ v, 



k-lk-l k-lk-1 

= E E = E E 

s— 0 t— 0 s— 0 t— 0 

k-lk-1 k-1 k-1 

= EE/3^"‘'^"" = E«^"" = E«^(m)- 



( 4 ) 



t—0 s—O 



t=o 



i=0 



Also when i = v, 

k—1 k—1 



k—1 k—1 



= EE/3^^^^"^^"^ = EE/^ 

t^u S—O 






s=0 t=0 
k-1 



s=0 



t^u S—O 



+ t' cr(t,v) 



k-1 



( 5 ) 






^cr{t,v) 



k. 



s=0 t^u 



t^u 



Therefore aai is computed by the sum of at most k basis elements in {ooj czi , • • • , 
an-i} for i ^ V and aa„ is computed by the sum of at most k — 1 basis elements 
and the constant term k G GF{q). Using these ideas, Gao et al. [4] showed the 
following. 

Theorem 1. Let a he a Gauss period of type (ji,k) over GF{q), with k and q 
bounded. For any 0 < r < g", can he computed in 0{n^) additions in GF{q). 

Sketch of Proof. Write r = X)j=o with 0 < r j < q. Then the following 
algorithm gives an output a’’. 



Table 1. An exponentiation algorithm in [4] 



Input: r = G?'’ 

Output: a’’ = rio<i<n-i 

A ^ 1 

for (i = 0 to n — I ; z + +) 
if Ti yf 0 

for (j = I to r* ; j + +) 
A ^ — Aoj 
end for 
end if 
end for 



Assuming that gth Frobenius map a — >■ a® is almost free, Aai is computed 
by 0{nk) additions in GF{q) in a redundant basis {ao,ai, - ■ ■ For 
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each i, the inner loop A •<— Aai runs Vi times. Therefore the total number of 
multiplications A ^ Aai is ^ 1)^- Since Aai is computed by 0{nk) 

additions, one can compute a^ by 0{kqrZ) additions in GF{q). □ 

If above theorem should have any application, it must be guaranteed that the 
Gauss period a is a primitive element in GF{q'^), or at least is of high order. 
This is not always satisfied. For example, a Gauss period a of type (n, 1) is 
never a primitive element since = 1 and n + 1 << g". However, various 

computational results imply that the Gauss period a of type (n, k), k >2, over 
GF(2) is very often primitive, and even in the cases that a is not primitive, it 
usually has a very high multiplicative order. For example, it is known [4] that, 
among the 177 values of n < 1000 for which a Gauss period a of type (n, 2) over 
GF(2) exists, a is a primitive element for 146 values of n. Moreover, when a is 
not primitive, it is usually of very high order. The same table in [4] implies that 
a Gauss period of type (n, k) over GF(2) is also very often primitive for fc > 3. In 
the table, it is shown that for approximately 1050 values of 2 < n < 1200, there 
is a primitive Gauss period of type (n, k) for some k, and in many cases, one 
can choose fc < 20. A theorem supporting this experimental evidence is obtained 
by Gathen and Shparlinski [17], where it is shown that a Gauss period of type 
(n, 2) in GF{q") has order at least 2'^“^ for infinitely many n. 

3 Hardware Arrangements for Exponentiation Using 
Gauss Periods of Type (n,k) in GF{2'^) for k >2 

Throughout this section, let us assume that q = 2. The algorithm in section 2 is 
not suitable for a hardware arrangement since one has to multiply different aj 
for each step and since the exact number of additions in the coefficients of the 
expression Aaj is unclear. From now on, instead of using a redundant basis as 
in section 2, we will always use a normal basis {o;o 7 ai, • • • , an-i} because our 
approach is more suitable for a unified and simple hardware architecture. Let 
A = YIZq ditti with tti G GF{2). Notice that there exist unique 0 < u < k — I 
and 0 < f < n — 1 such that 1 + t“2*' = 0 (mod p). In this case there is no 
0 < a{u,v) < n — 1 satisfying 0=1 + t^2'" G Therefore from the 

equations (4) and (5), 

n— 1 n— 1 fc— 1 

Aa = '^a^aia = dyk+ aia^y,i)- (6) 

i=0 i=0 t=0 

(t,i)^{u,v) 

For each 0 < i < n — 1, letting hj = |{0 < t < fc — 1|1 + G Kj}\ = |{0 < t < 
k — l|cr(t, i) = j}|, we have 

n— 1 n— 1 n— 1 /n— 1 \ 

Aa = a^k + Ui tyaj = a^k + a+y ] aj. 

i—0 j—0 j— 0 \i— 0 / 



( 7 ) 
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Lemma 1. If k is even, each coefficient of Aa is computed by the sum of at 
most k OiS, 0 < i < n — 1, and if k is odd, it is computed by the sum of at most 

k + 1 QiS. 

Proof. For each j, it is almost clear that the number ofO<i<n— 1 such that 
tij 0 is at most k. If not, there are fc + 1 different is with s = 1, 2, • • • , fc + 1 
such that 1 + G Kj for some tg, s = l,2,---,fc+l. Since the coset 

Kj = 2^ K is a set with k elements, there exist l<Sy^l<k+l such that 
1 + t*‘2^‘> = 1 + r*'2*', which implies i.s = ii- Therefore is the sum of 

at most k OiS. Because we have the field GF(2") of characteristic two, a^k = 0 if 
k is even and a^k = \i k is odd. Since the coefficient of Uj in Aa is YlZo 
if k is even and a„ + Y7=o ii^ ^ i® odd, our assertion is verified. □ 

Now we are ready to give a modified algorithm which is easily applicable to a 
hardware arrangement. 



Table 2. A modified exponentiation algorithm for a hardware pnrpose 



Input: r = 

Output: 

A ^ I 

for (i = n — 1 to 0 ; t ) 

A ^ 
end for 



Above algorithm is just a simple form of binary window method which computes 



Notice that by lemma 1, the operation A •<— in our algorithm needs at most 
k — I or k additions under the normal basis expression. One may realize above 
exponentiation in a linear array circuit consisting of n flip-flops, n 2-1 MUXs 
and at most n{k—l) or nk (depending on the parity of k) XOR gates. The initial 
value A = 1 is loaded in n flip-flops, i.e. we have oq = oi = • • • = a„_i = 1 
initially. The signal of r = loaded serially in descending order. That 

is. To, - ■ ■ , r„- 2 , r„-i — >. Since A A^ is free in a hardware arrangement (just 

a rewiring), A •<— A^a’’* is computed at most fc — 1 or A: additions for each 
coefficient. This operation can be done in one clock cycle. Namely, at ith clock 
cycle, all the coefficients of A^ and A^a are loaded as input values of the MUXs 
where the control signal is r„_i. Therefore if r„_i = 0, then A^ is selected, and 
if Un-i = 1, then A^a is selected. Let us remind that XOR is a 2-input XOR 
gate and MUX is a 2-1 multiplexer. Also Dx is the delay time of a XOR and 
P>M is the delay time of a MUX. 
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Proposition 1. Let a be a Gauss period of type {n,k), k>2, in GF{2^). Let 
r = Ti = 0,1. Then, 

(a) we can construct a linear array which computes a’’ using n flip-flops, n 2-1 
MUXs, and at most n{k — 1) XOR gates if k is even, at most nk XOR gates if 
k is odd. 

(b ) Each coefficient of aj consists of an XOR tree with at most k — 1 or k XOR 
gates. Thus the depth of each XOR tree is at most |"log 2 k~\ and the critical path 
delay of our architecture is [log 2 k'\ Dx + Dm ■ 

We present the design of the circuit in Fig. 1. Note that we get the result a'' 
after n clock cycles and at each ith clock cycle, and A^a are simultaneously 
computed and pass through MUX to get the correct value A ^ 




Fig. 1. A circuit for exponentiation using a type (n, k) Gauss period in GF(2" 



To show the power of our architecture, which is a linear array but involves many 
parallel computations, let us think of a finite field It is known [4] that 

the lowest complexity primitive Gauss period in GF(2^^®®) is of type (1188, 19), 
i.e. k = 19. In this case, our architecture needs 1188 flip-flops and MUXs, and at 
most 21384 XOR gates. But the critical path delay is only |"log 2 k~\Dx + Dm = 
5Dx + Dm- When n = 1194, we have a primitive Gauss period of type (1194, 2) 
in GU(2^^®^). Thus we need only 1194 XOR gates and the critical path delay 
is Dx + Dm- It should be mentioned that a linear array for exponentiation is 
proposed by Wu and Hasan [15] using a polynomial basis. Though their method 
is quite efficient, the complexity and the structure of the design heavily depends 
on the choice of primitive irreducible polynomial. However our array provides 
high flexibility and modularity with respect to field size n. In the following 
subsections, we will discuss the circuits of Gauss periods of type (n, 2) and (n, 3) 
which have low computational complexity. In these cases, the exact number of 
necessary gates will be determined rather easily. 



3.1 Optimal Normal Basis of Type II over GF{2) 

Let a = (3 he a. Gauss period of type (n, 2) in GF(2”), where 2n -I- 1 = p 

is a prime and /3 is a primitive pth root of unity in GU(2^"). It is also called an 
optimal normal element of type II and {ooi ai, • • • , a„_i} is called an optimal 
normal basis of type II over GF{2). It has the lowest complexity in the sense that 
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the sum of the number of nonzero terms in the expression of aai for 0 < z < n — 1 
is minimal, which is 2n — 1. Since {1, 2, • • • , 2n} and {±1, ±2, • • • , ±n} are the 
same reduced residue system (modp), we easily find {ao,ai, - ■ ■ ,a„_i} and 
{(3 + ■ ■ ■ , /3" + /?“"} are same sets. Letting 1 < 

s < n, it is clear that aa' = (/3 + /3“^)(/3* + /?“*) = + a'+i. A multiplication 

table can be constructed easily using above property. Or we may use the self 
dual property of a Gauss period of type (n, k) for even k. We say that two 
bases {/?i, / 32 ) • • • , Pn} and { 71 , 72 , • • • , 7 «} of GF{2^) are dual if the trace map, 
Tr : GF(2") — >■ GF(2), with Tr(P) = P+P'^ + - ■ •+/3^ , satisfies Tr{Pi"fj) = <5^ 

for all 1 < z, j < n, where Sij = 1 if z = j, zero if z yf j. A basis {/ 3 i,/ 32 , • • • , 
Pn} is said to be self dual if Tr{PiPj) = 6ij. One can directly prove that the 
Gauss period of type (zz, 2) in GJ^(2”) generates a self dual normal basis or more 
generally, one may refer the result in [3] which says that a normal basis of Gauss 
period of type (rz, k) in GF(2”) is self dual if and only if k is even. Using this 
self duality, or by a straightforward computation, one can show [ 20 ] that 

Lemma 2. Let P is a primitive pth root of unity in GU(2^") where p = 2zz+ 1 is 
a prime, and let a = P+P~^ be an optimal normal element of type II in GF(2"). 
Let a' = /3* + for all 1 < i < n. Let A = ^ ~ 

he elements in GU(2”). Then we have AB = where the jth 

coefficient {AB)j satisfies 

n 

(^AB^j = 'y ] bpOj—i -\~ (Zj-i-z), 
i=l 

where it is defined that oq = 0, Og = a_g if s is negative, and Og = a 2 n+i-s if 
s > n. 

For our purpose, we only need to know the formula of Aa with respect to the 
basis {a'l, a' 2 , • • • , o;[j}. Letting B = a = in above lemma, we get 61 = 1 and 
bi is zero if z yf 1. Thus {Aa)j = Oy_i + Oj+i for all 1 < j < zz. That is, 

Aq; = 020^1 T (oi T (13)0:2 +(02 T 04)0^3 + ■ ■ ■ +(on— 2T dn)o:n—iF{an—iFan)o:'„- 

(9) 

Using this formula and since {ao> «i, • • • , a„_i} and {a'^, ‘ ‘ ‘ i ct'n} same 

sets where = cP and «'=/?* + /?“*, we find that the circuit for exponenti- 
ation needs exactly zz flip-flops, zz 2-1 MUXs, and zz — 1 XOR gates. 

Proposition 2. Let a he a type II optimal normal element in GF(2”). Then 
we can construct a linear array which computes a’’ for any r = ^*2* with 

ri = 0,1 using n flip-flops, n 2-1 MUXs and rz — 1 XOR gates. The critical path 
delay of our architecture is Dx + Dm and the latency is rz. 

Example 1. Let p = 11 and zz = 5 where the existence of a type II optimal normal 
element a = P -\- P~^ is well known. Notice that a is a primitive element. Also 
note the following correspondence, 

oo = 01,01 = 02,02 = 04, 03 = -I- P ~^ = 03, 04 = -I- P ~^^ = 05, ( 10 ) 
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where = 1 is used. Now let A = aocto + aiai + 0202 + 0303 + a 4 a 4 . Then 
= 0400 + Ootti + 0102 + 02^3 + 0304 . From the correspondence (10), we get 
A^ = 04 a'i + aoa '2 + 0203 + 0104 + 0305 . Thus from the formula (9), 

A^a = A^a{ 

= OgOi + (o2 + 04)02 + (oq + 01)03 + (02 + 03)04 + (oi + 03)05 (11) 

= OgOg + (02 + 04)01 + (02 + 03)02 + (og + 04)03 + (oi + 03)04. 

The basis of our circuit is {og, oi, • • • , o„_i}, and at each fth clock cycle, the 
serial input r„_i selects via MUX one of the two values, A^ or This is 
realized in the following circuit shown in Fig. 2. 




Fig. 2. A circuit for exponentiation using a type II optimal normal element in 
GF(2") forn = 5 

3.2 Gauss Period of Type (n, 3) over GF(2) 

Let 3n + 1 = p is a prime and /3 is a primitive pth root of unity in GF(2^'^). 
Let a = f3 + A + f3'^ G GF(2”) be a Gauss period of type (n, 3) where r is 
a generator of the unique cyclic subgroup K of order 3 in GF{p)^ . Note that 
there is unique u and v such that 

l + r“2" = 0 G GU(p). (12) 

Also notice u yf 0 because —1 ^ K = (t). We claim that u is a unique integer 
satisfying 

1 + T, 1 + G = 2’'iC. (13) 

Since r is an element of order 3 in GF(p)^, we get 

t2 + t + 1 = 0. (14) 

Thus by (12) and (14), 

r + T2 = -l = T“2^ (15) 

which implies 

1 + r = r“-l 2 ^ and 1 + + r) = (16) 



Therefore the equation (13) is verified. Now from the equation (5), 
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aUy 



Oi “h 3 ^cr{ti,v) ^cr{t2,v) 

t^u 



(17) 



where ^ 1,^2 We claim that a{ti,v) ^ a{t 2 ,v). In fact, more generally we 
have 



Lemma 3. Ifi yf 0,w, then (t( 0, t), cr(l, z), <t( 2, i) are all different. If i = 0, then 
cr(l,0) = cr(2,0) = V and cr(0,0) = 1. Ifi = v, then a{ti,v) a{t 2 ,v). 



Proof. The second statement is already proved in view of the equation (13). Now 
suppose i 0,v. To prove the first statement, we have to show that 1 + 2*, 1 + 
r2*, 1 + t^ 2* are in all different cosets of K in GF{pff . Suppose on the contrary 
that there exist s t such that 1 + r^*2* and 1 + r*2* belong to the same coset. 
Then we have 



1 + t*2* 
1 + r'*2* 



&K = {1,T,T^}. 



(18) 



Since s t, = 1 is impossible. Suppose = t. Then 1 + t*2* = 

r + r'*+^2*. If t = s + 1 (mod 3), we get r = 1 which is absurd. Therefore 
t = s + 2 (mod 3) and we get 1 + r^+^2* = r + r'*+^2*. Thus we get 2* = 
G K, which is a contradiction since 0 < z < n — 1 and n 

is the least positive integer satisfying 2” G K. Now suppose Then 

and the same technique can be applied. The proof of the last 
statement is also same. □ 



From lemma 3, the multiplication structure of aui is completely determined. 
That is, when z = 0, aa^ = Oi consists of one basis element. For i = v, we have 
aay = acr(ti,v) + cea(t 2 ,v) + 1 with a{ti,v) a{t 2 ,v). And for i ^ 0,v, we get 
aai = acr(o,i) +Q!CT(i,i) +o:cr( 2 ,i) where all the summands are different. Therefore, 
except for the constant term 1 in the expression aa„, the number of elements 
which are in the summands of aa^, 0<z<rz— lis exactly 3(rz — 1). On the 
other hand, ay appears only once as a summand of aai for some z since two ay 
in the expression of aag are cancelled each other. Moreover ao appears twice as 
a summand of aai for two different values of z. This is because we have only two 
different pairs of (t, i) satisfying 1 + t‘2* G K , i.e. 1 + r‘2® = 1 is never satisfied. 
Since the proof of lemma 1 says that aj appears at most 3 times as a summand 
of aai, 0 < z < rz — 1, we conclude that a^ {j 0, v) appears exactly 3 times as 
a summand of aai. Letting A = multiplication structure of Aa 

in the equation (7) says. 



n-l / n-1 \ 

Aa — ^ ( I Oy ^ ( Qitij I aj. (If^) 

j=0 V i=0 J 

From the observations on the number of basis element as a summand of aai, 
we conclude that a„ + X)r=o^ needs 2 additions if j = 0, one addition if 
j = V, a{ti,v),a{t2, v) and 3 additions if j 0, v, cr(ti, z;), a{t 2 ,v). 
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Proposition 3. Let a he a Gauss period of type (n, 3) in G_F(2”). Then we can 
construct a linear array which computes a’’ for any r = with = 0, 1 

using n flip-flops, n 2-1 MUXs and 3n — 7 XOR gates. Each coefficient of aj 
{j 0,v,a{ti,v),a{t2,v)) consists of 3 XOR gates, For j = v,a{ti,v),a{t 2 ,v), 
each coefficient consists of one XOR gate, and the coefficient of ao needs 2 XOR 
gates. Thus the critical path delay of our architecture is 2Dx + Dm and the 
latency is n. 

Example 2. Let p = 19 and n = 6 where a Gauss period a of type (6, 3) in GF{2^) 
exists and is primitive. In this case, the unique cyclic subgroup of order 3 in 
GF(19)^ is K = {1, 7, 11}. Let /3 be a primitive 19th root of unity in GF(2^^). 
Thus letting r = 7, a is written as a = /3 -I- /3^ -I- The computations of 
aai, 0 < z < 5 is easily done from the following table. For each block regarding 
K and K' , (s, t) entry with 0 < s < 2 and 0 < t < 5 denotes t'* 2* and 1 -I- r®2* 
respectively. 



Table 3. Computation of Ki and Kl 



Ko 


Ki 


K 2 


Ks 


Ki 


Ks 




K[ 


K '2 


K's 


K 


Kl, 


1 


2 


4 


8 


16 


13 


2 


3 


5 


9 


17 


14 


7 


14 


9 


18 


17 


15 


8 


15 


10 


0 


18 


16 


11 


3 


6 


12 


5 


10 


12 


4 


7 


13 


6 


11 



From above table, we easily deduce 



otex — ex\, otcx \ — ot\ -t- 02 -t- 05, otex2 — olq -t- 04 -t- 05, (^1^) 

003 = O2 “t“ O5 -t“ 1 , 004= Ot 2 O3 -t“ O4, OO5 = OtQ ot\ 04. (^ 1 ) 

For example, see the block K'2 for the expression of ao;2- The entries of K'2 are 

5 , 10 , 7 . Now see the blocks of KiS and find 5 G K^, 10 G K^, 7 G Kq. Thus we 

get aa2 = «4 -I- as -I- ao. Note that v = 3 and a{ti,v),a(f2,v) = 2 , 5 in our 
example. Let A = X)i=o element in GF{2^). Then 

where a ^ is understood as oq. Thus 

5 

A^a = OiOi+ia 
i=0 

= ao{ai 02 -l- Os) -I- ai(ao 04 -I- 05) -I- 02(02 + 05 -I- 1 ) 

-|- 03(02 -l- 03 -|- 04) -|- 04(00 -l- oi -|- 04) -|- 05O1 (22) 

= 02 -l- (oi -|- 04)00 -l- (oo -l- 04 -|- 05)01 -|- (oq -I- 02 -I- 03)02 
T 0303 -l- (oi -l- 03 -|- 04)04 -|- (oo -l- Oi -|- 02)05 
= (oi -|- 02 -l- 04)00 -l- (oo -l- 02 -l- 04 -|- 05)01 -|- (oo -l- 03)02 
T (o2 T 03)03 -|- (oi -l- 02 -l- 03 -|- 04)04 -|- (oo -l- 01)05. 



Thus the exponentiation algorithm is realized in the following circuit. 
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Fig. 3. A circuit for exponentiation o’" nsing Gauss period of type (n, 3) in GF(2") 
for n — 6 



4 Primitive Elements Spanned by an Optimal Normal 
Basis of Type I over GF{2) 

Let a be a Gauss period of type (n, 1) over GF(2), where n + 1 = p is a prime. 
a is also called a type I optimal normal element and the corresponding normal 
basis is called a type I optimal normal basis. Note that a is never primitive and 
has a very low order, i.e. = 1 where n + 1 << 2" — 1. Therefore one cannot 
use the algorithms in Table 1, 2 for exponentiation for a practical purpose. On 
the other hand, it should be noticed that there are not so few primitive elements 
in arbitrary finite field GF(2^). That is, the number of primitive elements in 
GF(2") is </>(2" — 1), where 4>{x) is Euler’s p/iz- function. Thus, the probability 
for a randomly chosen element a G GFiT^)^ to be a primitive element is 




where the product runs through all primes q dividing 2" — 1. As long as 2" — 1 
is not a product of many small prime factors, which is a necessary condition to 
avoid the Pohlig-Hellman attack for discrete logarithm problem, the probability 
is not so small. In fact, the following formula for average value of the probability 
is well known [25], 



Y 4>{n)/n= + 0{\ogN). (24) 

n— 1 

Of course, our choice of a is not a randomly chosen element. Though a is not a 
primitive element, we may ask a natural question whether there exists a primitive 
element which is a sparse polynomial of a, for example, a binomial of the form 
a® + a*. However it turns out that they are never primitive if n > 4. To show 
this, note that a® + a* = a®(l + a*“®) = a®(l + for some j. Also from 
the observation, (1 + = 1 + ^ = 1 + a~^ = (1 + a)/a, we get a = 

(1 + ^ Therefore, neither 1 + a nor a® + a* is a primitive element if 
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n + 1 < 2”/^ + 1, i.e. if n > 4. The next possible choice is the elements of the 
form a® + a* + a*, or more simply, 1 + a® + a* because the multiplication by 
a contributes a negligible order n + 1. For this type of elements, we could not 
use the same technique as proving 1 + a is of low order. In fact we found, by 
a computation, that a trinomial 1 + a + a® of a type I optimal normal element 
a in GF(2^) is always a primitive element for some s with only one exception 
among all n < 550. 

Table 4. List of n < 550 for which a type I optimal normal element a exists and its 
corresponding primitive element 



n 


primitive 

element 


n 


primitive 

element 


n 


primitive 

element 


4 


1+a+Y 


138 


l+a+a"' 


372 


1+a+a® 


10 


1+a+Y 


148 


1+a+Y 


378 


1+a+a® 


12 


1+a+Y 


162 


1+a+a^ 


388 


1+a+a'^ 


18 


1+a+a^ 


172 


1+a+a^ 


418 


1+a+a® 


28 


l+a+Y+Y 


178 


l+a+a^ 


420 


1+a+a® 


36 


1+a+a® 


180 


1+a+a® 


442 


1+a+a® 


52 


1+a+a^ 


196 


1+a+a® 


460 


1+a+a'^ 


58 


1+a+Y 


210 


1+a+a^ 


466 


1+a+a® 


60 


l+a+Y 


226 


l+a+a^ 


490 


1+a+a'^ 


66 


l+a+a'^ 


268 


1+a+a® 


508 


1+a+a® 


82 


l+a+Y 


292 


1+a+a^ 


522 


1+a+a® 


100 


1+a+Y 


316 


1+a+a’^ 


540 


1+a+a’’ 


106 


l+a+Y 


346 


1+a+a’^ 


546 


1+a+a® 


0 

CO 


l+a+Y 


348 


1+a+a® 







We used MAPLE for above computation. In the case of n = 28, there was 
no primitive element which is a trinomial of a, so we chose the next simple 
expression. Let 7 = 1 + a® + a* be a fixed primitive element in GF(2”) where 
a is a type I optimal normal element. Let {l,a, ,Q;”} be an extended 

AOP (all one polynomial) basis. Then we have the following algorithm which 
computes Y using the basis {l,a,Y, - ■ ■ , a"}. 

Table 5. Exponentiation using 7 = 1 + a® + a* under the extended AOP basis 

Input: r = 

Output: Y 
A ^ 1 

for (i = n — 1 to 0 ; i ) 

A ^ A^Y‘ 
end for 



Note that above algorithm is applicable for both software and hardware purposes. 
Though the case g = 2 is dealt in above algorithm, one may also use other small 





240 



S. Kwon, C.H. Kim, and C.P. Hong 



primes g = for efficient exponentiation. The operation A <— is free 

in our basis because {1, a, A, - ■ ■ , a”} = {1, a, , • • • , }. Now letting 

A = 

Aj = A{1 + A + A)=A + AA + Aa* 

n n n n ^ (25) 

— ^ ^ CliCX “t“ ^ ^ O^i—sOi “t“ ^ ^ Oji—tCX — ^ ^ T ^i—s “t“ — 

i—0 i—0 i—0 i — 0 

where the coefficients Ui,aj are understood as = aj if i = j (mod n + 1 ) 
since = 1. Therefore the computation A ^ 2 I 7 needs 2 additions for each 
coefficient of a* (0 < i < n) and the total number of bit additions needed to 
compute 7 ’’ is 2n(n+l) which is of 0{n?). By following the same ideas of previous 
section, we find that 

Proposition 4. Let a he a type I optimal normal element in GF{T^) and as- 
sume that 7 = 1 + o'* + a* is a primitive element for some s and t. Then we can 
construct a linear array which computes 7 ’’ for any r = with = 0, 1 

using n + 1 flip-flops, n + 1 2-1 MUXs and 2n-\-2 XOR gates. The critical path 
delay of our architecture is 2Dx + Dm and the latency is n. 

Example 3. Let n = 4 and let a be a type one optimal normal element, i.e. a 
is a 5th root of unity over GF{2). It is trivial to show that 7 = 1 -I- a -I- is 
a primitive element in GF{2‘^). Let A = a^-f- a\a -\- 020 ? -\- -\- be an 

element in GF(2^) with respect to the extended AOP basis. From 

Afl = do 4“ a^oc -\- a\ofl -\- a^of^ -\- a2eA , 

A (X = 02 4“ aQ(x 4“ a^ex 4- a^ex^ 4- a^cx , 
cx^ = Gi 4“ a4.(x 4“ 02of 4- a^ex^ 4- a^ex^ , 

we find 

= A^(l 4- a 4- a^) 

= (oq 4“ cii 4“ 02 ) 4“ (do 4“ 03 4“ 04)0 (27) 

4“ (di 4“ 02 4“ 03 ) 0 ^ 4“ (do 4“ di 4“ d 4 )o;^ 4- (02 4- d 3 4- d 4 )o;^. 

From this information, the computation 7 ’’, r = u2^ is easily realized in 

the following circuit. 




Fig. 4. A circuit for exponentiation in GF{2'^) for n — A, where 7 is a trinomial of 
a type I optimal normal element a 
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Table 6. Comparison with previously proposed exponentiation architectures 





[10] 


[12] 


[13] 


Fig. 1 


latency 


2n^ + 2n 


n(n — 1) -t L?rJ +1 


n(n — 1) 


n 


critical 
path delay 


‘^Dand + ‘^Dxor 


Dand + ‘^Dxor 


‘2^Dand + 2Dxor 


|’log2 k] Dxor + Dmux 


complexity 

AND 


4n^(n — 1) 


3n^(n — 1) 


3n(n — 1) 


0 


XOR 


4n^(n — 1) 


3n^ {n — 1) 


3n(n — 1) 


kn 


MUX 


0 


0 


0 


n 


flip-flop 


14n^(n - 1) 


|n^(n — 1) 


4n(n — 1) 


n 



5 Conclusions 

We proposed a compact and fast exponentiation architecture using a Gauss pe- 
riod of type (n, k) in GF(2”) for all k. Using the computational evidence that a 
Gauss period of type (n, k), k >2, is very often primitive and by modifying the 
multiplication algorithm given by Gao et ah, we successfully constructed low 
complexity arithmetic circuits which have possible applications such as smart 
cards. Also for the case of a type I optimal normal element, i.e. a Gauss period 
of type (n, 1), we found primitive elements which yield low complexity multi- 
plication structure and we gave an exponentiation algorithm which is applica- 
ble for both software and hardware purposes. We presented explicit designs of 
the circuits for Gauss periods of type {n,k) when k = 1,2,3. Table 6 implies 
that our linear array has many superior properties in terms of latency and gate 
complexity compared with other existing exponentiation architectures, though 
our circuit works only for a fixed primitive element. The critical path delay 
[log 2 k~\ Dxor + Dmux in our architecture is not so long since in most of the 
cases of n < 1200, we could choose a Gauss period of type (n, k) where k < 32. 
That is [log 2 fc] < 5. 
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Abstract. This paper describes new algorithms for computing a 
modular inverse e~^ mod / given coprime integers e and /. Contrary to 
previously reported methods, we neither rely on the extended Euclidean 
algorithm, nor impose conditions on e or /. The main application of 
our gcd-free technique is the computation of an RSA private key in 
both standard and CRT modes based on simple modular arithmetic op- 
erations, thus boosting real-life implementations on crypto-accelerated 
devices. 

Keywords: Modular inverses, RSA key generation, prime numbers, ef- 
ficient implementations, embedded software, GCD algorithms. 



1 Introduction 

The usual way one computes a modular inverse is by applying the extended 
Euclidean algorithm [8, Algorithm X, p. 325]. Given e and / on input, this 
algorithm returns integers a and j3 such that ae + j3f = gcd(e, /). Assuming 
that e is relatively prime to /, and therefore that e has an inverse modulo /, it 
follows that ae = 1 (mod /) meaning that a = e~^ (mod /). 

Unfortunately, for code-optimization reasons, this algorithm is hardly avail- 
able on embedded platforms. Instead, algebraic tricks based on simple modular 
arithmetic are highly preferred because gcd-type calculations may be too in- 
tricate to handle on cryptoprocessors compared to modular operations. As an 
example, executing an extended binary gcd may require much less arithmetic 
or logic operations on large numbers than glue instructions such as register 
switches, loop control, pointer management, etc., rendering this approach com- 
paratively prohibitive to straightforward, arithmetic-only implementations. But 
then, one requires that one of the two input values, e or /, is prime. Indeed, when 
/ is prime, the inverse of e modulo / is given by Fermat Little Theorem stat- 
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ing that e ^ ^ (mod /); when e is prime, e ^ mod / is given by Arazi’s 

well-known inversion formulad Little is known about the other cases. 

This paper presents simple ways for computing e~^ mod / without the (ex- 
tended) Euclidean algorithm (or variants thereof) and without any restrictions 
on e or /. Our techniques only invoke usual basic operations like (possibly mod- 
ular) additions, multiplications and exponentiations. Since these operations are 
optimized on devices supporting public-key cryptography, the technique we pro- 
pose is especially well-suited for smart-card on-board computation of an RSA 
private key, in both standard and CRT modes. 



2 Arazi’s Inversion Formula 



When / is prime, the inverse of e modulo / is given by Fermat Little Theorem 
because d = mod / = mod /. When / is not prime and provided that 
e is prime, the usual trick consists in applying Arazi’s inversion formula, which 
expresses e~^ mod / in terms of mod e. 



Lemma 1 (Arazi). Let e and f be two positive integers. //gcd(e, /) = 1 then 



d = e ^ mod / 



1 + /(-/ ^ mod e) 
e 



( 1 ) 



Proof. Define U = e(e“^ mod /) -|- /(/“^ mod e). Since {7 = 1 (mod e) and 
{7=1 (mod /), it follows that {7=1 (mod ef). Hence, noting that 1 < e+ f < 
U < 2ef, this implies that U = 1 + ef, or equivalently that e~^ mod / = 
[1 -I- /{e — (/“^ mod e)}]/e as desired. □ 



Hence if e is prime, its inverse d modulo / can easily be computed as 

1 + f{-r~'^ mod e) 
e 

This formula is limited to prime values for e, but is easily extended to 

( 2 ) 

e 

whenever A(e) is known. We recall that computing Carmichael’s function A(e) 
from e requires to factor e, a task which imposes a very strong computational 
requirement. So, the extended technique given by Eq. (2) is of no interest if 
the inversion algorithm is not given A(e) as an input. The same remarks are 
independently stated in [6]. 



^ Named after Arazi who was the hrst to take advantage of this folklore theorem to 
implement fast modular inversions of RSA exponents on a crypto-processor. 
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2.1 Implementing Arazi’s Formula with Modular Operations 

Equation (2) requires an integer division (i.e., over Z), which is performed either 
directly by the cryptoprocessor (some of them may provide this functionality) 
or in the following way. We compute d = d mod 21-^1 where |/| stands for the 
binary length of /. The multiplication and incrementation in Eq. (2) are done 
modulo 21-^1, and then the division is replaced by a multiplication by e~^ mod 
21-^1. This value may be hard-coded in the program when this is possible, or 
dynamically computed as depicted on the algorithm of Fig. 1. Another algorithm 
for performing an integer division can be found in [5, p. 235]. 



Input: e (odd), |/| 

Output: e~^ mod 2^^ 

T^riog2(l/l)l;s/^i 

for i = 1 to T do 

y-^yi‘2- ey) mod 2* 
endfor 

return y mod 2^^^ 



Fig. 1. Inversion e i-O- e ^ mod 2^^^ 



Note that all operations involved here are modular. Besides, this technique 
turns out to be extremely fast (only 10 iterations for a typical size |/| = 1024), 
especially for small values of e. 



2.2 The Case of Composite Numbers 

In the sequel, II will always denote the product of small primes U = 
for / C N and where pi is the prime (i.e., pi = 2, p 2 = 3, . . . ). Unless stated 
otherwise, we assume that I = [1, k] for a certain bound k depending on the 
context of use for 77. We also assume that the choice for 77 has been done once 
and for all, and that 77 and A(77) are absolute constants coded in our algorithms. 

Now suppose that / is some composite number with unknown factorization. 
We consider different scenarios depending on the information we have about the 
operand e: 

1. e is known at compile time. We thus have access to A(e) which may be 
written or coded in the program itself; 

2. e is an input data and is provided along with A(e); 

3. e is provided alone, but is known to be prime (and thus A(e) = e — 1); 

4. e is given (e.g., dynamically loaded and provided) but nothing else is known 
about e. 
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The first three situations lead to the implementation given below. The fourth 
context of use is somewhat more intricate and requires a specific treatment as 
shown later in Section 3. 

We adopt the following twofold approach: 

— one attempts to retrieve A(e), or a multiple thereof, in order to invoke the 
previous, very efficient technique; 

— if unsuccessful, one computes d without that knowledge but in an heavier, 
somewhat pathological way. 

In some cases, retrieving A(e) may be quite efficient. When e is smooth enough 
so that e | II, then a multiple of A(e) is simply A(7T). This may also hold without 
necessarily having e | iT. In many situations, the following will be sufficient. Set 

A = e(e - l)A(77)i7 , (3) 

and execute the inversion algorithm of the previous section by replacing A(e) — 1 
by A — 1. Get the result d and test whether e d= 1 (mod /). We output d if this 
equality holds. Otherwise, we know that the structure of e is less simple than 
originally thought.^ 

The next section describes new efficient approaches that always return the 
value of d = e~^ mod / whatever the conditions on e and/or /. 

3 Extended Algorithms for Composite Integers 

3.1 Algorithm 1 

Our idea is fairly simple. It is based on the somewhat obvious observation that 

e-i = (e + C/)-i (mod/) 

for any integer C. Therefore we can add an appropriate multiple of / to e so 
that the result is prime and then apply Arazi’s inversion formula directly to it. 
Define 

e = e + Cf . 

We require e to be a prime. A naive way to find such an e consists in trying 
(7=1,2,... and so on until e + Cf is prime [4]. We can however do much better. 

Proposition 1. Let e and f be two positive integers with gcd(e, /) = 1. Let also 
n = Y[ Pi be a product of (small) primes. Define 

e = e + Cf with C = [l - mod H . (4) 

Then we have gcd(e,pi) = 1 for all primes pi dividing LI. Moreover, we have 
gcd(e,/) = 1 . 

^ In this event, e is a composite number with at least one prime factor Ci with Ci\ II 
such that 6i — 1 has some large prime factor . . . 
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Proof, (i) Consider first the case gcd(/,pi) = pi. Then from Eq. (4) we have 
e = e (mod Pi). Moreover, since by definition gcd(e, /) = 1, it follows that 
gcd(e,pi) = gcd(e,pi) = 1. 

(ii) Suppose now that gcd(/,pi) = 1. If gcd(e,pi) = pi then Eq. (4) yields 
(7=1 (mod Pi), which implies e = / (mod pi) and consequently gcd(e, pi) = 
gcd(/,pi) = 1. Conversely, assuming gcd(e, pi) = 1 forces (7 = 0 (mod pi) and 
then e = e (mod pi) which, again, leads to gcd(e, pi) = gcd(e, pi) = 1. 

Finally, as e = e (mod /), it follows that gcd(e, /) = gcd(e, /) = 1. □ 

As e is co-prime to all small primes Pi | 77, it is likely to be a prime number 
which we can test using some primality test.^ If the test is unsuccessful, we re- 
iterate the process with another candidate +/7I, and so forth until 

e is a prime number. Remark that the updated e, satisfies 

(mod {pi,/}) and so also verifies Proposition 1. 

We note that our technique differs from the one described in [7] in several 
ways, and in particular, the building of e from e is not probabilistic. The resulting 
algorithm is detailed on Fig. 2. 



Input: e, / with gcd(e, /) = 1 

n = Upi, A(77) 

Output: d = e ^ mod / 

C ^ [1 - (mod ny.e^e + Cf 

while (e is not prime) do 
e e -I- fn 
endwhile 

u <r- mod e; d [1 -h /(e — u)]/e 

return d 



Fig. 2. A first inverting algorithm (Algorithm 1) 



There may exist variations of Algorithm 1. To illustrate the diversity of our 
technique, we provide here another alternative. We state: 

Proposition 2. Let e and f be two positive integers with gcd(e, /) = 1. Let also 
n = Y[ Pi be a product of (small) primes, and c € Z^. Define 

e = e + Cf with C = [{c - e)f^^^'>~^] mod n . (5) 

Then gcd(e,pi) = 1 for all pi dividing LI. 

® Popular primality tests like Fermat’s test or Miller-Rabin’s test are easy to implement 
with basic modular operations and hence are quite fast in practice. 
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Proof, (i) Consider first the case gcd(/,pi) = pi. Then from Eq. (5) we have 
e = e (mod Pi). Moreover, since by definition gcd(e, /) = 1, it follows that 
gcd(e,pi) = 1. 

(ii) Suppose now that gcd(/, pi) = 1. Then Eq. (5) yields e = c (mod pi) and 
thus we find again gcd(e, pi) = 1 since c G ZJj. □ 

As e is co-prime to all small primes pi | II , it is likely a prime number. 
Otherwise we re-iterate the process with another candidate c G ZJj . 



Input : f , e, II , and a G IIjj \ {1} 

Output: df 



1. Compute U naod II 

2. Set C [(c — e)U mod 71] and e e -|- Cf 

3. If (T(e) = false) then 

a) Set c<— a c (mod 77) 

b) Go to Step 2 

4. Compute F< mod 

5. Output d/ = (Ff + l)/e 



Fig. 3. An alternative algorithm (Algorithm 1’) 



Again, note that all operations in the above algorithm exclusively rely on 
basic modular arithmetic. If the cryptoprocessor cannot handle integer divisions 
directly, the division by e in the last step can be computed with the algorithm 
described in Fig. 1. 



3.2 Algorithm 2 

A second algorithm can be derived by exchanging the roles of e and / in Propo- 
sition 1. Doing so, we obtain a prime /. Two applications of Arazi’s formula will 
give thus the expected result. 

Since / is prime, the inverse of e modulo / is given by u = e-^~^ mod /. 
Noting that f = f (mod e), a first application of Arazi’s formula enables to 
recover the value of mod e as 



V := f ^ mod e = 



l + e(/-u) 

/ 



(6) 



and a second application yields 



d = e ^ mod / 



1 + f{e-v) 



e 
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In many cases, the value of e is small compared to that of /. Moreover, using 
the fact that gcd(e, /) = gcd(e mod /, e), we obtain the following corollary of 
Proposition 1. 

Corollary 1. With the notations of Proposition 1, define e = e mod / and 

i=e + Cf with C = [1 - mod 7T . 

Then we have gcd{e,Pi) = 1 for all primes Pi dividing U. Moreover, we have 
gcd(e,/) = 1. 

Proof. Straightforward by replacing e with e in Proposition 1. □ 

Therefore, we can advantageously consider / instead of / (remember that for 
our second algorithm the roles of e and / are exchanged in Proposition 1) and 
so evaluate v in Eq. (6) as 



v = f ^ mod e = 



1 + e(/ - m) 

/ 



where f = f mod e. 

Putting all together, we obtain a second algorithm for computing modular 
inverses. 



Input: e, / with gcd(e, f) ~ 1 

n = Y\pi. \{n) 

Output: d = e ^ mod / 




f f mod e 

if (/ is not prime) then 

(modiT); f< 


-f + Ce 


while (/ is not prime) do 

f^f + en 


[i] 


endwhile 

endif 

u (mod /) 

d^ [/ + /(eu- l)]/(e/) 


[ii] 


return d 





Fig. 4. Our second algorithm (Algorithm 2) 



This second algorithm is particularly efficient when e is small since then II 
may be chosen smaller, which in turn implies smaller values for C and for /. On 
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the contrary, the first algorithm (Fig. 2) is more suitable when the size of e is 
sensibly the same as that of /. 

As easily seen, the choice for parameter II remains completely free. We now 
discuss the best way to choose II in practice. Primality checks executed in our 
’while’ loop involve integers of bitsize close to |e7T|. As most practical imple- 
mentations for primality testing are of cubic complexity, a single test has a cost 
oc |e77|^. Moreover, the average number of tests amounts to 



|eil| -ln2- 



</)(lcm(e, 77)) 
lcm(e, 77) 



Totalling these two facts, and upper bounding the ratio (/)(lcm(e, 77))/lcm(e, 77) 
by </>(77) /77, the average workfactor for finding / is bounded by a function pro- 
portional to (|e77|)‘*(()(77)/77. Therefore, provided that 77 = HiPo optimal 
choice for k with respect to a given |e| is easily found. Interestingly, for small 
parameter lengths such as |e| = 32 or 64, the optimum is obtained for 7 = 3 
(77 = 2 • 3 • 5), i.e., for an extremely small value of 77. Algorithm 2 then performs 
only a few primality checks over integers of size close to the one of e, and is 
therefore very fast. 



Remark 1. Certain hardware implementations return the value of / div e to- 
gether with the value of / mod e when computing the remainder of an integer 
division. In this case, the division by e/ in the expression of d (cf. [|] in Fig. 3) 
can be reduced to a division by /. Initializing C to 0 and keeping track of its 
accumulated value, we can replace Line [i] by 

c^c + n-J^ f + en [7'] 

and Line [ii] by 

d^[fu-{f dYve) + C]/f . [W] 



4 Application to RSA 

RSA [11], named after its inventors Rivest, Shamir, and Adleman, is undoubtedly 
the most widely used cryptosystem. We give hereafter a short description and 
refer the reader to the original paper or any textbook on cryptography for further 
details. 

Let n = pq he the product of two large primes. We let e and d denote a 
matching pair of public exponent/private exponent, according to 

ed=l (mod A(n)) , (7) 

where A is Carmichael function. In particular, for the RSA, we have n = pq and 
A(n) = lcm(p — l,q— 1). 

Given x € ]0,n[, the public operation (e.g., encryption of a message or ver- 
ification of a signature) consists in raising x to the power e, modulo n, i.e., 
in computing c = x® mod n. Next, from c, the corresponding private operation 
(e.g., decryption of a ciphertext or a signature generation) is mod n. From 
the definition of e and d, we obviously have that = x (mod n). 
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4.1 Standard Mode 

In standard mode, on input p, q and e, one has to compute the private exponent 
d satisfying Eq. (7). We assume that we are given U, the product of small primes, 
along with A(7T). These numbers are pure constants, and are thus easily hard- 
coded into the implementation. 

When e (or its factorization) cannot be determined in advance, a direct ap- 
plication of Algorithm 1 (or Algorithm 2) with e and / = lcm(p — 1, g — 1) on 
input will output the corresponding secret key d. 

If the value of lcm(p — 1,<7 — 1) cannot be computed, one can replace the 
Carmichael function of n by the Euler totient function and take / = (p— 1)(<7— 1). 
This, however, results in a larger yet valid value for d. 

From a computational viewpoint, taking |/| = 1024 and |e| = 32 for instance, 
a typical implementation of Algorithm 2 would use the specific choice II = 2- 3- 5. 
Thus, around 6.83 primality tests over 37-bit numbers are required, on average. 
When I/I = 1024 and e = 64, the same choice for 77 yields an average of 12.75 
primality tests over 69-bit numbers. In addition to that, operations starting 
and ending the algorithm are almost negligible: computing C amounts to a few 
squares modulo 2 • 3 • 5; u requires an exponentiation of size close to |e|; and 
the computation of d (thanks to our technique on Fig. 1) boils down to a few 
multiplications carried out modulo 



4.2 CRT Mode 

The private operation can be speeded up through Chinese remaindering (CRT 
mode) [9]. The computations are performed modulo p and q and then recom- 
bined. The private parameters are {p,q,dp,dq,iq) with dp = (imod(p— 1), 
dq = d mod (<7 — 1) and iq = q~^ mod p. We then obtain mod n as 

CRT(xp, Xq) = Xq + q[iq{xp - Xq) mod p] , 

where Xp = mod p and Xq = mod q. The expected speed-up factor is 4, 
compared to the standard (i.e., non-CRT) mode. 

In CRT mode, the procedure is readily the same. We apply Algorithm 1 or 
2 where inputs e and / are initialized to e (mod {p— 1)) and p—1, respectively. 
This yields the value of the private exponent dp. Similarly, the exponent dq is 
obtained by initializing e and f to e (mod {q — 1)) and q — 1, respectively. 



4.3 Standard Mode (II) 



There is another way to compute the private key in standard mode. We first 
compute dp and dq as described in the previous section. Next, letting Q := q—1 
and A := \{n)/{q — 1), we compute the inverse of Q modulo A, say Xq, thanks 
to the algorithm of Fig. 2 as^ 



Remark here that we have to compute the inverse of (g — 1) modulo 
mod / exists if and only if gcd(e, /) = 1. 



lcm(p-l,ij-l) „„ „-l 
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where 






1 + A —A^ ^ mod Q] 



(8) 



Q = q-l + CQA with Cq = [1- (g- modil + ^i-iT 

for some fx > 0 such that Q is prime. Therefore, Chinese remaindering on dp and 
dq finally gives d = dq + {q — 1) — dq) mod A], 

5 Conclusion 

We devised new algorithms for computing modular inverses in a gcd-free manner. 
We stress that, implementing our techniques, an RSA key generation process can 
be executed on any given crypto-enhanced embedded processor in almost every 
circumstances. 



Acknowledgements. We are grateful to Jean-Frangois Dhem for pointing out 
reference [4] and to Karine Villegas for her careful reading of an earlier version 
of this paper. 
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Abstract. Efficient implementations of RSA on computationally lim- 
ited devices, such as smartcards, often use the CRT technique in com- 
bination with Garner’s algorithm in order to make the computation of 
modular exponentiation as fast as possible. At PKC 2001, Novak has pro- 
posed to use some information that may be obtained by simple power 
analysis on the execution of Garner’s algorithm to recover the factor- 
ization of the RSA modulus. The drawback of this approach is that it 
requires chosen messages; in the context of RSA decryption it can be re- 
alistic but if we consider RSA signature, standardized padding schemes 
make impossible adaptive choice of message representative. 

In this paper, we use the same basic idea than Novak but we focus 
on the use of known messages. Consequently, our attack applies to 
RSA signature scheme, whatever the padding may be. However, our 
new technique based on SPA and lattice reduction, requires a small 
difference, say 10 bits, between the bit lengths of modulus prime factors. 

Keywords: Simple Power Analysis, RSA signature, factorization, LLL 
algorithm. 



1 Introduction 

Since the introduction in 1996 of the timing attacks by Kocher [5], many papers 
have considered various side channel attacks and the potential countermeasures. 
Side channels attacks allow to extract some information on the manipulated data 
which can be used to recover secret data. This general kind of attacks can be 
divided into several different techniques: timings attacks [5], or Simple Power 
Analysis (SPA) and Differential Power Analysis (DPA), both introduced in 1999 
by Kocher, Jaffe and Jun [6]. Lots of countermeasures have been proposed against 
such attacks but addressing all weaknesses when implementing an algorithm is 
a hard task. Power attacks are very difficult to prevent, and thus, most of the 
time, countermeasures do not suffice to thwart all of them. 

In the public key setting, many papers have focused on the security of cryp- 
tosystems against such side channel attacks [5,11,10]. In particular, the RSA 
signature and encryption schemes have been mainly studied. To sign a message 
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M with RSA, M is first transformed using appropriate padding scheme and hash 
function into a representative m G Zpf, where N is the product of two primes 
p and q. Then m‘^ mod N is computed, where d is the secret key of the signer. 
This yields the signature for the message M. 

Side channel attacks on RSA extract information from the exponentiation 
step. For example, by precisely measuring the time it takes for the cryptographic 
device to perform the RSA signature, an attacker can recover the secret key, as 
shown by Kocher in [5]. This timing attack can be mounted against naive im- 
plementations of RSA using the repeated square and multiply algorithm. The 
attack recovers the secret exponent d, one bit at a time. Indeed, for each non zero 
bit on the secret exponent d, an additional multiplication is performed. Analyz- 
ing differences between running time for various input values reveals the secret 
key. Another classical way to attack basic RSA implementation is to use SPA 
technique that consists in measuring the power consumption during exponen- 
tiation [3]. Since power consumption also allows to determine if the additional 
multiplication is done, all bits of the secret exponent can be recovered by moni- 
toring only one exponentiation. 

Optimized implementations are also subject to attacks. The Chinese Remain- 
der Theorem is a well-known technique to optimize RSA exponentiation. In a 
CRT implementation, the signer first computes separately the signature modulo 
each prime factors p and q. He then uses the Chinese Remainder Theorem to 
compute the signature S mod N. Since the size of p and q is about half the size 
of N, CRT exponentiation is about four times faster than direct exponentiation. 
The first attack on RSA-CRT has been presented in 1997 by Boneh, DeMillo 
and Lipton [1]. It is based on fault injection during computation. By using a 
valid signature for a message and a faulty one, the modulus N can be efficiently 
factored. A timing attacks against RSA with the Chinese Remainder Theorem 
(CRT) is also possible [11], when the Montgomery algorithm is used for squaring 
and multiplication operations. Recently, Novak [10] has described an adaptive 
chosen message attack against smart cards implementations of RSA decryption 
when the Garner’s algorithm implements the CRT. This attack is based on a 
simple power analysis (SPA). The power consumption of the card leaks infor- 
mation on the secret manipulated data. The cryptanalyst goal is to relate such 
information to the bits of the secret key. Although this attack can be mounted 
against RSA decryption scheme, it is not realistic in practice against the RSA 
signature scheme. Indeed, a padding scheme is used in practical implementations 
and then chosen inputs attacks cannot be made. 

In this paper, we show how to extend this attack to the case of RSA signature 
based on any encoding scheme, such as PKCS#1 [7]. In particular we show that 
with a simple power analysis, if the RSA modulus N = pq is such that q < pj2^, 
the RSA factors p and q can be recovered by performing 60 x 2^ signatures 
on average. The value I should be larger than an explicit bound we precise in 
this paper. These signatures can be computed on any messages, not necessarily 
chosen by the adversary. 
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— Input: A message M to sign, the private key {p,q,d), 
with p > q, the pre-calculated values dp — d mod p — 1, 
dg — d mod q — 1, and u = q~^ mod p. 

— Output: a valid signature S for the message M 

1. Encode the message M in m £ Zn (with PKCS#1) 

2. Compute Sp = mod p 

3. Compute Sq = mod q 

4. Set t = Sp — Sq 

5 . If t < 0 then t t + p 

6 . Compute S — Sq -£ [{t ■ u) mod p) ■ q 

7. Return S as a signature for the message M 



Fig. 1. The RSA-CRT signature generation with Garner’s algorithm 



In the next section, we briefly describe the RSA-CRT signature scheme im- 
plemented with the Garner’s algorithm. Then, we develop a new technique to 
factor N when having access to a set of special form integers modulo N . In 
section 4, we precisely describe the attack and how to collect such special form 
RSA signature. We also give practical results on experiments. Finally, we propose 
classical countermeasures to thwart this attack. 

2 RSA Signature Scheme 

Let N = pq an n-bit RSA modulus. The public key of the signer is denoted 
by (N,e) and the private key by {p,q,d), where e and d are such that e ■ d = 
1 mod (p — l)(g — 1). Let M be a message to sign. A signature for M is S' = 
m‘^ mod N, where m is deduced from M by an encoding scheme, randomized or 
not, such as PKCS#1 for example [7]. To check if a signature S is valid for M, a 
verifier simply computes m and checks if the equality m = mod N holds. Note 
that the encoding step is mainly used in practice to avoid some basic attacks on 
RSA. The requirement on the encoding scheme is that the outputs are uniformly 
distributed in Z^y. 

Smart cards implementations of RSA frequently use the Chinese Remainder 
Theorem to speed up the computation of S' = to'* mod N . The Garner’s algo- 
rithm is an efficient method to determine the signature S from Sp = S mod p 
and Sq = S mod q. This algorithm does not require any reduction modulo N 
but uses instead reductions modulo the factors p and q. It is thus more efficient 
than the classical implementation of the CRT. A detailed description of this al- 
gorithm can be found in [9] and in [4]. In figure 1, we describe the RSA signature 
generation using the Garner’s algorithm. 

Step 5 of this algorithm needs some explanation: we first remark that the 
value t computed at the previous step may be negative, since, as we assume 
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q < p, it lies in the range [—q,p\. However, the modular multiplication tu mod p 
has to be performed in the next step. Since inputs for modular multiplications 
have to be already reduced in the range [0,p — 1], if t is negative, p should be 
added so that t > 0. Therefore step 5 consists in computing t mod p before the 
modular multiplication with u. 

In [10], Novak has described a method to factor an RSA modulus when the 
Garner’s algorithm is used for CRT. This attack applies to the RSA encryption 
schemes. It is based on the observation that, for a message m encrypted into 
c, if nip = mod p is smaller than nig = mod q, step 5 of the decryption 
algorithm (similar to 5 of the signature generation in 1) is performed. Otherwise, 
no addition is made in this step. The analysis of the power trace gives the 
information on the execution of such a conditional step. In other words, using 
a SPA analysis, an adversary is able to detect whether the addition t ^ t + p 
is performed, and then to deduce if rup < mg. This information allows a binary 
search to recover the factor p: an attacker searches for a plaintext m such that 
m mod p < m mod q and (m — 1) mod p > {m — 1) mod q. Such a plaintext can 
be efficiently found with a binary search combined with a simple power analysis. 
Once m is found, Novak has remarked that m is in fact a multiple of the factor 
p that can then be deduced as the GCD of m and the modulus N. However, this 
attack is only possible in a chosen-plaintext scenario. Thus it cannot be made 
in practical implementations of the RSA signature scheme due to the encoding 
step. Indeed, an adversary is still supposed to choose the message M to sign 
but does not have enough control over the encoding m of M, particularly when 
randomization techniques are used. 

In the following we show how to recover the factor q using this leaked infor- 
mation even if a padding scheme is used. However, we require the prime factors 
of the modulus to be slightly unbalanced. 

3 Lattice Based Techniques 

3.1 Preliminaries on Lattices 

We denote by ||x|| the Euclidean norm of the vector x = {x\, . . . , Xd+i)^ defined 
by ||x|| = Let vi, . . . , v<j, be d linearly independent vectors such that 

for 1 < t < d, Vi G We denote by L, the lattice spanned by the matrix 

V whose rows are Vi, . . . , Vd. L is the set of all integer linear combinations of 

Vi,... ,Vdi 

L = I ^CiVi, Ci G Z 
U=i 

Geometrically, det(L) is the volume of the parallelepiped spanned by Vi, . . . , Vd. 
The Hadamard’s inequality says that det(L) < ||vi|| x . . . x ||vd||. 

Given (vi, . . . , Vd) the LLL algorithm [8] will produce a so called “reduced” 
basis (bi, . . . ,bd) of L such that 

llbill < 2(^-L/2det(L)i/^ 



( 1 ) 
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in time 0{d‘^log{M)) where M = maxi<i<d||vi||. Consequently, given a basis of 
a lattice, the LLL algorithm finds a short vector bi of L satisfying equation (1). 
Moreover, we assume in the following that the new basis vectors are of the same 
length and also have all their coordinates of approximatively the same length. 
Indeed, a basis for a random lattice can be reduced into an almost orthonormal 
basis. Therefore, || 6 i|l « ||&i|| for 1 < i < d, and so WbiW^ « det(T). 



3.2 Factoring Using LLL 

In the following we describe a new method, based on lattice reduction, to factor 
a modulus N given some special form integers Sj in Zjv . 

Let N = p X qhe an RSA modulus such that p and q are two prime integers. 
Let si, S 2 , ■■■, Sd he d integers from Z^r. For each Si, we consider its euclidian 
division by p. Vt G [1, d], we can write Si = ri + Ui x p with G Zp and Ui € Zq. 
Let us assume that, instead of being distributed all over the set Zp, the rjS values 
are smaller than a bound A < p/2. We further consider the lattice L spanned 
by the d + 1 rows of the following matrix: 

/ N 0 0\ 

0 N : 

0 ... 0 IV 0 

\-Si -S 2 . . . -Sd a) 



Theorem 1. Assuming the LLL algorithm returns the shortest vector of a lat- 
tice, the reduction of lattice L computes the factorization of modulus N with 
probability > 1 — £q if the bit-length difference between p and A is such that 

logp - log A > max g ~ ~ 0-105 ^ 047 ^ log ^ 

As an example, for a 512-bit prime factor q and a probability of success of 
the algorithm > 1 — 2“^°, we obtain a minimum logp — log A « 10.4 for d = 60. 
Note that this minimum does not strongly depend on the size of q and is about 
5 bits for any cryptographic size of this factor. 

Sketch of proof. [A complete proof is proposed in appendix A] 

By definition of the lattice L, a vector of L is an integer combination of the rows 
of the matrix. In other words, we may define the lattice in the following way : 

L = {(ciiV - csi,C 2 N - CS 2 , . . . ,CdN - csd,Ac) ; (ci, C 2 , . . . , Cd, c) G Z‘^+^} 

For a fixed choice of the integer coefficients (ci, C 2 , . . . , Cd, c), we note 

6 (ci, C 2 , ... ,Cd,c) = (cifV - CSi,C2N - CS 2 , . . . , CdN - CSd, Ac) G L 
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In other words, the lattice L is the set of all the vectors b{ci,C2 , ... ,Cd,c) for 
(ci,C2,... ,Cd,c) G 

A special vector of the lattice, strongly related with the q prime factor of N, 
is “abnormally” short. The consequence is that we can expect the LLL lattice 
reduction algorithm to compute this short vector. 

This special vector is b* = b(ui,U 2 , ■ ■ ■ , Ud, q), where ut is defined by division 
of the Si by p, and q is the other factor of the modulus N = p x q. Note that 
the knowledge of b* immediately reveals q since its last coordinate is Aq and A 
is known. The size of b*, i.e. its euclidian norm, can be easily estimated and we 
obtain ||6*|| < \^d+ lAq. 

Then, in order to prove that the vector b* is the shortest one, we study the 
Euclidean norm of the vectors 6(ci, C2, . . . ,Cn,c), and we prove that, if c yf <7, 
those vectors are larger that b*, whatever the c^s may be, for all the Si but only 
a very small fraction. A precise analysis, described in appendix A, leads to the 
result of theorem 1. 

□ 

Therefore the knowledge of d values s in Zjv such that |s mod p\ < A allows 
us to factor N . In the following, we describe how SPA can be used to find such 
d values in the context of RSA signature generation, with “slightly” unbalanced 
modulus. 



4 Application to RSA-CRT Signature Scheme 

In this section, we use the results presented above to extend the chosen ciphertext 
attack described by Novak in [10]. In particular, we show that if the factors p 
and q of the modulus are such that \p\ — jgj > i, for a given bound i, then they 
can be recovered with a known message attack combined with a simple power 
analysis. In the following we suppose that a SPA attack allows us to detect 
if the addition of step 5 is performed during a signature generation. Such an 
assumption is realistic in practice if no countermeasure is implemented, and a 
detailed way to extract this information can be found in [10] and in [5]. 

Attack. In the previous section we have shown how the prime factors p and q of 
a modulus N can be recovered, given a set of integers s in Zjv such that s mod p 
is less than a given bound. We apply this result by using a simple power analysis 
on the RSA signature scheme in order to find these integers. 

We assume that the prime factors p and q are such that |p| — |g| > £, for £ 
a small integer. Such an assumption is realistic in many actual implementations 
since, in many descriptions of the RSA algorithm, we can find that p and q have 
to be of “roughly” the same length, or “about the same bit-length”. Here, we 
consider that p and q have a very small bit-length difference, about 10 bits for 
an 1024-bit modulus. This does not constitute a contradiction with the usual 
description of RSA, that can be interpreted in many different ways. 

Let S' be a signature for a random message M, computed by using the algo- 
rithm described in figure 1. We suppose that the step 5 of the algorithm has been 
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performed to generate the signature S. Otherwise, we choose another random 
message until this step is executed. Since the optional addition has been made, 
we know that Sp — Sg < 0. Thus, since we assume that q < p, we simply have 
that Sp < Sq < q < p. By definition of Sp = S' mod p, S can always be written as 
S = Sp + u X p for an integer u < q. Consequently S is a candidate input to the 
factoring algorithm, given in section 3.2, for the upperbound A = q on s mod p. 
The problem here is that clearly, this bound is not known to the attacker. Thus, 
the last entry A of the matrix cannot be explicitely given. However, A is an 
upperbound on the Si mod p where Si is the ith input of the last row. Thus 
choosing an integer A> q\s a, correct choice and we choose in practice A to be 
the largest integer such that |H| = Igj. 

To run the lattice reduction described in the previous section, we have to 
find d signatures Si such that Si mod p is less than the bound A we choose. 
We thus query the signature of messages and we perform a SPA attack on each 
generation. Each signature verifying Sp < Sg is kept as input to the matrix. We 
query the signing card until d valid candidate signatures have been found. 

The number d of required signatures has to be sufficiently large, according 
to the bit-length difference £ between the factors p and q, and according to 
the modulus length. To estimate the average number of queries made to the 
signing card, we compute the probability that Sp = m'**’ mod p is less than 
Sq = mod q, for q < pj2^^ and for random integers s G 'Ln- We suppose 
that the values s are uniformly and independently distributed in Zj\r. This is 
verified in practice when an appropriate hash function, such as SHA-1, is used. 
In this case we assume that the output m of the encoding scheme is uniformly 
distributed in Z^v so that s = mod N is uniformly distributed. Let Sp and Sg 
the values computed during the signature generation in steps 2 and 3 respectively. 
We have: 



9-1 

Pr {sp < s^} = ^ Pr {sp < B\sq = B} ■ Pr {s, = B} 

B=0 

— — 1 — ~ 1 
P q “2pq 2^+1 

where the last inequality comes from the fact that q < pj2^. 

Thus in a set of 2^ signatures, there is a probability of one half that at least 
one of them is such that Sp < Sg. Detecting such a signature is possible with 
a SPA attack during the signature generation: when the step 5 is performed, 
we know that the signature Si is a good candidate input for our algorithm, if 
we write it as Si = {si modp) -I- Uip where si mod p < pj^^. Otherwise, we 
query another signature until we find a good candidate. On average, 2^ trials 
are needed. To have d such signatures, a set of d ■ 2^ signatures is required. 
The algorithm described in section 3.2 to factor a modulus N can then be used 
with the candidate signatures as inputs. Following the analysis made above, the 
number d of required signatures must be such that £ > -|- 2 so that the 

algorithm successfully ends with probability greater than 1 — 2“^°. However, we 
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modulus length 
in bits 


IpI - kl 


lattice dimension 
d -I- 1 


average number of 
required signatures 


time to factor 


512 


8 


41 


2is 


30 s 


7 


51 


2^3 


2 min 


6 


61 




14 min 


768 


10 


43 


2ib 


50 min 


9 


51 




2 min 


8 


61 


F* 


15 min 


1024 


12 


46 




2 min 


11 


56 


2“ 


6 min 


10 


61 




16 min 


1536 


16 


53 


2^^ 


7 min 


15 


56 


2^1 


10 min 


14 


61 


2^0 


32 min 


2048 


20 


53 


2^6 


11 min 


19 


56 


2^5 


20 min 


18 


61 




32 min 



Fig. 2. Experimental results on the RSA-CRT with unbalanced modulus. 



assume here that q < p/2^, and thus, logg = Thus, after some simple 

computations we obtain: 

^ logAl- 20-b4(i 
- 2d+l 

Thus, if p and q are such that q < pj2^ for £ greater than ^<7 + 1 

factor N from dx2^ signatures on average and with probability at least 1 — 2“^°. 

Experimental results. In practice, the lattice dimension depends on the value £, 
and on the number of available signatures. However, if the lattice dimension is 
too large, then the LLL algorithm fails. Particularly, under a reasonable time, 
it is not possible to run the LLL algorithm on a 100 x 100 dimensional matrix 
where the entries are 1024-bit numbers. 

For each modulus length, we give in figure 2 the integer £ such that q < pj2^, 
the dimension c?-|- 1 of the lattice, the average number of required signatures and 
the time needed to recover p and q. The number of required signatures, equal 
to d X is upperbounded by since we always have d < 2®. The 

tests have been run on an Intel Pentium IV, XEON 1.5 GHz, with the Victor 
Shoup’s library NTL ([12]). 

From this previous table, we show that the LLL algorithm works better 
than the theoretical results indicated in section 3.2. For 1024-bit modulus, the 
expected result gives £ > 10. These values are realistic since a difference of 10 
bits between p and q for a 1024-bit modulus means that p is a 517-bit prime 
and g is a 507-bit prime number. This distance between p and q is not ruled 
out by the specifications of the key generation of the RSA algorithm. It is worth 
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noticing that this attack can be extended to more “secure” moduli, say 2048 
bits long. Moreover, this attack can be mounted in practice since the number of 
required signatures for known messages is not large, one million if n = 1536. 

5 Countermeasures 

Such an attack proves that an implementation of the Garner’s algorithm should 
be carefully checked so that SPA or other side channels attacks should not be 
possible. We now propose some countermeasures to avoid this attack. 

In order to balance time execution and power consumption, dummy opera- 
tions can be added. This can be done by modifying the step 4 of the algorithm as 
follows: first, t is computed as Sp — Sq. Then, new variables t' and t" are respec- 
tively set as t+p and t. The step 5 is then defined as: 5 . If t <0 then t ^ t' , 
else, t •<— t" . In this case, the implementation does not leak any information 
about the difference Sp — Sq since the addition is always performed. The crucial 
remark here is that this implementation should use a probabilistic encoding step 
(Step 1 of figure 1). If not, another attack is possible: suppose for example that 
PKCS^l vl.5 is used. Thus the encoding of a message is always the same. In 
this case, this countermeasure can be broken by using safe errors attacks [13]. 
Such attacks use fault injection at particular computational step to produce an 
error during operations, possibly unused depending on some secret data. Here a 
fault can be performed during the computation of t' . If the resulting signature 
is not valid (the card outputs a failed error), we learn that for this signature, t' 
has been used, that is t < 0. In this case, we ask again the card with the same 
message without producing any error during the generation. We know that the 
resulting signature is such that t < 0, since it has been computed on the same 
input TO. We can then use this signature as input for our algorithm. 

Another classical countermeasure [2] is based on the randomization of to: a 
signature s is computed on r® x to. The signature for to is given as s/r mod N. In 
this case, since r is kept secret, there is no relationship between the information 
leaked by the card on the value t and the output signature s/r mod N. Thus, in 
this case, our attack is no longer feasible. Note that it is also possible to fully 
randomize all the parameters of the signature generation, as done by many ac- 
tual implementations of RSA signatures: the factors p and q are randomized as 
p' = ri X p and q' = T 2 x q and the signature s is deduced from s mod p' 
and s mod g' respectively computed as ((to mod p) -I- x and 

((to mod q) + r '2 X signature is finally given by s mod N. Due 

to a full randomization of each step of the signature generation, such an imple- 
mentation is not subject to our attack. 

Acknowledgments. We wish to thank anonymous referees for their construc- 
tive remarks and suggestions of countermeasures. 
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A Proof of Theorem 1 

By definition of the lattice L, a vector of L is an integer combination of the rows 
of the matrix. In other words, we may define the lattice in the following way: 

L = {{ciN - csi,C 2N -CS2,... ,CdN - cSd,Ac) ; (ci,C2 , ... ,Cd,c) G 

For a fixed choice of the integer coefficients (ci, C2, . . . ,Cd,c), we note 

6(ci,C2,... ,Cd,c) = (ciN - csi,C 2N - CS2, . . . ,CdN - cSd,Ac) G L 

In other words, the lattice L is the set of all the vectors b{c\,C2 , ... ,Cd,c) for 
(ci,C2,... ,Cd,c) G 
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We now show that a special vector of the lattice, strongly related with the q 
prime factor of N , is “abnormally” short. The consequence is that we can expect 
the LLL lattice reduction algorithm to compute this short vector. 

This special vector is h* = b(ui,U 2 , ■ ■ ■ , Ud, q), where Ui is defined by division 
of the Si by p, and q is the other factor of the modulus N = p x q. Note that 
the knowledge of b* immediately reveals q since its last coordinate is Aq and A 
is known. Let us evaluate the size of 6*, i.e. its euclidian norm 

\\b*\f = \\b{ui,U 2 ,... ,Ud,q)\\'^ 

= \\{uiN - qsi,U2N - qs2, ■ ■ ■ ,UdN - qsd, Aq)\\'^ 

d 

= ^ {uiN - q{r^ + Uip)f + A^q^ 

i^l 

= ^q\f + AV 

Since 0 < r* < A, we immediately obtain that ||6*||^ < dq^A^ + A^q^, and so 

||6*|| < yfd+lAq. 

In order to prove that the vector b* is abnormally short, we now show that 
the other elements of the lattice L are, with overwhelming probability, larger. 
With this aim in view, we define, for any integer c, the function 

T{c)= min \\b{ci,C 2 , . ■ . ,Cd,c)\\ 

(C1,C2,... ,Cd)eZ‘i 

i.e., lF(c) is the size of the shortest vector in L whose last coordinate is equal to 
Ac. From the definition of 3^{c), we can derive the following expression 



T{c) = min 

(C1,C2,... ,cd)ez<^ 

Then, it is easy, for a fixed c, to find the c^s that reach the minimum since T{c) is 
the minimum of a sum of independent squares. As a consequence, the minimum 
is reached when each term is as small as possible. This means that Cj is the 
nearest integer of cSi/N] we note this integer. We finally obtain 



T{c) = 

We can now notice that, by definition of lF(c) and b* = b{ui,U 2 , ■ . ■ , Ud, q), we 
have J^{q) < ||6*|| < Vd+ lAq. In fact, b* is exactly the smallest vector with 
last coordinate equal to Aq because 






Attacking Unbalanced RSA-CRT Using SPA 265 







since, for 0 < ri < A < |, ^ =0. This finally proves that T{q) = ||6*||. 

Then, we can further notice that if c > ^d+1 x q, T{c) \s obviously greater 
than Ac so T{c) > ^/d+ 1 x gx A > T{q). We finally need to evaluate, for a fixed 
c ^ q, the probability that lF(c) < T{q) where the probabilities are computed 
when the Si are uniformally distributed in 

S = {r + u X p ; 0 < r < A , 0 < u < q} C Z^v 



If this probability is negligible, we can conclude that b* is, with overwhelming 
probability, the shortest vector of the lattice. 

We first notice that the distribution of the x N — c x Si, when Si is 

uniformally distributed in Zjv, is uniform between and . This is an 

obvious consequence of the fact that G [—1/2; 1/2] and that if 

xfV — cxsi = o; for an integer a, we obtain by modular reduction that 
— c X Si = a mod N and thus that Si = —a x c~^ mod if c is prime with N. 

If we restrict the possible values of Si to the set S, the previous result implies 



{ W ^ ^ ^ 



-(A^-1) A^-1 



We can further write 



= { ^ X N - csi ; Si = n + Uip , 0 < r, < A , 0 < Ui < qj 

f CTi CUi 1 I 

= < — H \ X N - CTi - cu^p ;0<r*<A,0<Mi<g/ 



N q \ 

CTi CUi mod q 

~N ^ ~q 



X N — cri — {cUi mod q) x p; 0 < Vi < A, Ui € Z^ 



CT ■ V ' 

^ — X N - cn - ViP ; 0 < Ti < A , 0 < Vi < q 

N q 



Then, for any a G [— (A^— 1)/2, (A^— 1)/2], if ^ x N — cxn — ViXp = a 

we obtain that a = —cvi — Vip mod N. Since Vi is uniformally distributed in Z^ 
and a + p = —cvi — {vi — 1 mod q) x p mod N, we conclude that, if a is an 
element of T>c, a + p (or a + p — N if a + p> N/2) is also an element of Vc- 
So, if gcd(c,N) = 1, the set is a subset of that is invariant 

by (circular) translation of length p. 
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Fig. 3. Graphical representation of the pairs (si, S 2 ) for the parameters d = 2, p = 11, 
q = 7, N = 77 and A = 5. The disk of radius N A covers a ratio of those points that is 
illustrated in figure 4. 



In other words, this formalizes the idea that, if c is prime with N, the elements 



of T>c are well distributed in 



-{N-l) N-1 

2 ’ 2 



. As a toy example, figure 3 represents 



the elements of P 3 x P 3 for Ai = 7 x 11 and A = 5. 

Always with the aim of computing the probability for ■F(c) to be less that 
J^{q), if gcd(c, iV) = 1, we can state that 



Pr {T{c) < T{q)} < Pr 

(si,S 2 ,--- ,Sd)6‘S‘* ,8^)65“^ 



< Pr 

(si,S 2 ,... ,Sd)eS'^ 



^T{c) < \/d+ lxlg| 



< Pr 

(si,S 2 ,... ,Sd)65‘* 



a 

, i=l 



LiV 



N — cSi) <{d+l)A"‘q 



Let Z\ = y/d+ lA/p; we need to evaluate the probability 



Pr 

(S1,S2,--- ,Sd)&S'^ 



a 

E( 

, i=l 



CSi 

Llv 



N-csA <N‘^A^ 




Attacking Unbalanced RSA-CRT Using SPA 267 




2 



Fig. 4. Ration of points in figure 3 covered by a disk of radius N A. The irregular 
experimental curve represents the number of points covered by the disk and the smooth 
curve is based on the approximation by the surface of this disk. 



From the previous result on the distribution of Xi = ^ in [— 5 , 5 ], this 

probability can be approximated, for large N , by 



Pr 









where the X^s are independent and uniformally distributed over • If 0 < 

A < ^, this probability is equal to the volume of the d-dimensional ball of radius 
A. Figure 4 provides a graphical illustration of this fact, using the toy example 
of figure 3. If d is even, this volume is equal to 7 r‘^/^Z\‘^/((i/ 2 )!. 

Note that this approximation of the number of points in the ball of radius 
NA (see figure 3) using the volume of this ball is very good for values of c that 
are relatively prime with q, even if the repartition of the points is not perfectly 
uniform. This is mainly due to the compensation of local errors of estimation. 
When c = q, such a compensation does not apply and the approximation can 
no longer be used. Then, using the well-known Stirling formula, we obtain the 
upper-bound 



Pr 

(si,S2,... ,Sd)e5‘* 




CSi 

iW 



N-csA <N‘^A‘^\< 






y/dM 



X 
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We finally obtain 



Pr {T{c) < Hq)} < 



/2e7rZ\2y X 

X 



\ d J 



Note that this is true only if Z\ < i, i.e., if A < In other words, in 

order to make the proof correct, this means that the difference — \og{A/p) of 
bit-length of p and A must fulfill the inequality — log > log (2-v/cT+T) 

Using the fact that < e, we finally obtain an upper bound of the 

probability for J^(c) to be smaller than J^(q) 



Pr {J-(c) < J-(l)} < 

(S1,S2,... ,sd)eS'^ 



( 2eTiA? d -I- 1 \ 

V p" "" ~ ) 



X 



< 




1 



VdjT 



The last step is to estimate for which values of the parameters we can consider 
that T{c) > T{q) = ||6*|| for any c ^ q. Let Pq = 1 ~ £o be a lower bound of 
the probability for b* to be the smallest vector of the lattice. From the previous 
results, using an approximative argument of independence of the probabilities 
for different values of c, we deduce 



Pr 

(SI,S2,... ,sd)eS‘‘ 



{Vcyf gP(c) > P(g)} > l-\hr 




If So > (v^d~TT X q) X Y^^(\/27re) f , the last expression is greater than 
Pq. Thus, the inequality can be reworded as follows 



-log(3) >U^log 



\P J 



d -I- 1 
d 



log £o + log g -f ^ log + dlog 



> logg-log£o+pog(^) ^ 



In other words, this means that, if the difference — log(A/p) of bit-length of p and 
A is larger than i°g<?~i°sj^o-o.i05 _|_ 2.047, the algorithm finds q with probability 
> 1 — £ 0 ) assuming that LLL returns the shortest vector of the lattice. 

□ 
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Abstract. The recent developments of side channel attacks have lead 
implementers to use more and more sophisticated countermeasures in 
critical operations such as modular exponentiation, or scalar multiplica- 
tion in the elliptic curve setting. In this paper, we propose a new attack 
against a classical implementation of these operations that only requires 
two queries to the device. The complexity of this so-called “doubling 
attack” is much smaller than previously known ones. Furthermore, this 
approach defeats two of the three countermeasures proposed by Coron 
at CHES ’99. 

Keywords. SPA-based analysis, modular exponentiation, scalar multi- 
plication, DPA countermeasures, multiple exponent single data attack. 



1 Introduction 

Modular exponentiation or scalar multiplication are the main parts of the most 
popular public key cryptosystems such as RSA [15] or DSA [13]. This very sensi- 
tive operation can be efficiently implemented in smart cards products. However, 
data manipulated during this computation should often be kept secret, so the im- 
plementation of such algorithms must be protected against side channel attacks. 
For example, during the generation of an RSA signature by a device, the secret 
exponent is used to transform a message related data into a digital signature via 
modular exponentiation. 

Timings and power attacks, initially presented by Kocher [9,10] are now well 
studied and various countermeasures have been proposed. Those attacks repre- 
sent a real threat when we consider operations that both involve secret data and 
require a long computation time. The consequence is that naive implementation 
of RSA based or discrete log based cryptosystems usually leak information about 
the secret key. 

In this paper we present a new side channel attack, that we called “dou- 
bling attack” , which allows to recover the secret scalar used in the binary scalar 
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multiplication or the secret exponent used in the binary exponentiation algo- 
rithm. It is worth to notice that contrary to previous attacks which work for the 
“Left-to-Right” and the “Right-to-Left” implementations of the binary modular 
exponentiation, the doubling attack only works for the “Left-to-Right” imple- 
mentation. Furthermore, the “Left-to-Right” implementation is often used since 
it requires only one variable. The new attack enables to recover the secret key 
decryption of RSA [15] or the key decryption of ElGamal [4]. It can also be 
used to obtain the secret key of the Diffie-Hellman authentication system. We 
only focus on the decryption cases. In this attack we assume that the adversary 
mounts a chosen ciphertext attack. This is a valid assumption in a side channel 
scenario, since randomized paddings avoiding chosen ciphertext attack, such as 
OAEP [1], are checked after the running of the decryption process. As a con- 
sequence, the binary exponentiation or multiplication is always performed and 
side channel attacks can be mounted on these algorithms. The attack on the 
RSA cryptosystem is a direct application of the doubling attack. On discrete-log 
based cryptosystems, we describe the attack in the elliptic curve setting, since 
the doubling attack allows to defeat classical countermeasures which are mainly 
proposed to elliptic curve systems. 

In this paper, we first remind classical binary scalar multiplication algo- 
rithms. Then, we shortly describe different types of side channel attacks such 
as simple power analysis and differential power analysis but also the attack of 
Messerges, Dabbish and Sloan [11] in order to motivate the most frequently used 
countermeasures. 

Then, we present an improvement of Messerges et al attack that applies when 
so-called downward algorithms are used. It has a much smaller complexity since 
it only requires two queries to the device in order to recover all the secret data. 
This new attack is called doubling attack' since it is based on the doubling 
operation in the elliptic curve setting. We also explain how to use this new 
attack to defeat Coron’s countermeasures [3]. 

2 Binary Scalar Multiplication Algorithms 

In classical cryptosystems based on the RSA or on the discrete logarithm prob- 
lem, the main operation is modular exponentiation. In the elliptic curve setting, 
the corresponding operation is the scalar multiplication. From an algorithmic 
point of view, those two operations are very similar; the only difference is the 
underlying group structure. In this paper, we consider operations over a generic 
group, without using any additional property. The consequence is an immediate 
application to the elliptic curve setting but it should be clear that all what we 
state can be easily transposed to modular exponentiation. 

Scalar multiplication is usually performed using the “double-and-add” 
method that computes d x P using the binary representation of the scalar 

d = ELo diX2^ ■. 

n 

d X P = '^di X (2* X P) 

z=0 
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Two versions of the double- and- add algorithm are usually considered, according 
to the order of the terms in the previous sum. The first routine starts from the 
most significant bit and works downward. This method is usually called “Left- 
to- Right” (see figure 1). 



S = 0 

for i from n down to 0 
S = 2.S 

if di = 1 then S = S -\- P 
return S 



Fig. 1. Downward “Left-to-Right” double-and-add(P,d) 

The second routine starts from the least significant bit and works upward. 
This method is also known as “Right-to-Left” (see figure 2). 



5 = 0 
T = P 

for i from 0 to n 

if di = 1 then S = S -\-T 
T = 2.T 
return S 



Fig. 2. Upward “Right-to-Left”double-and-add(P,d) 

The first implementation is the most frequently used since it requires less 
memory. Up to now, no distinction was made on the security of those routines 
since all proposed attacks can be adapted to both implementations. In the fol- 
lowing sections, we focus on the downward implementations and we show that 
it may be much more easily attacked than upward versions. 

3 Power Analysis Attacks 

It is well known that naive double-and-add algorithms are subject to power 
attacks introduced by Kocher et al [10]. More precisely, they introduced two 
types of power attacks : Simple Power Analysis (SPA) and Differential Power 
Analysis (DPA) we now shortly remind. 

3.1 Simple Power Analysis 

The first type of attack consists in observing the power consumption in order to 
guess which instruction is executed. For example, in the previous algorithm, one 
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can easily recover the exponent d = distinguishing the doubling 

from the addition instruction. To avoid this attack, downward double-and-add 
algorithm is usually modified using so-called “dummy” instructions (see figure 3) . 



SfO] = 0 

for i from n down to 0 
S[0] = 2.S[0] 

S[l] = S[0] + P 
S[0] = S[di] 
return S'[0] 



Fig. 3. Downward double-and-add(P,d) resistant against SPA 



Although this new algorithm is immune to SPA, a more sophisticated treat- 
ment of power consumption measures can still enable to recover the secret scalar 

d. 

3.2 Differential Power Analysis 

DPA uses power consumption to retrieve information on the operand of the 
instruction. More precisely, it no longer focuses on which instruction is executed 
but on the Hamming weight of the operands used by the instruction. Such an 
attack has been described in the elliptic curve setting in [3,14]. 

This technique can also be used in a different way. Messerges, Dabbish and 
Sloan introduced “Multiple Exponent Single Data” attack [11]. Note that, for 
our purpose, a better name would be “Multiple Scalar Single Data”. We first 
assume that we have two identical equipments available with the same imple- 
mentation of algorithm 3, one with an unknown scalar d and another one with 
a chosen scalar e. In order to discover the value of d, using correlation between 
power consumption and operand value, we can apply the following algorithm. 
We guess the bit of d which is first used in the double-and-add algorithm and 
we set e„ to this guessed value. Then, we compare the power consumption of 
the two equipments doing the scalar multiplication of the same message. If the 
consumption is similar during the two first steps of the inner loop, it means that 
we have guessed the correct bit Otherwise, if the consumption differs in the 
second step, it means that the values are different and that we have guessed the 
wrong bit. So, after this measure, we know the most significant bit of d. Then, 
we can improve our knowledge on d by iterating this attack to find all bits as it 
is illustrated in the algorithm of figure 4. 

This kind of attack is well known and some classical countermeasures are 
often implemented. For example, the Chaum’s blinding technique [2] can be used 
to protect an RSA implementation since it prevents an attacker from knowing 
the data used in the exponentiation. This method cannot be applied directly for 
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for i from 0 to n 

6i = 0 

for i from 0 to n 

On — i — 1 

choose M randomly 
double-and-add(P,d) on equipment 1 
double-and-add(P,e) on equipment 2 
if no correlation at step {i + 1) e„_i = 0 
return e 



Fig. 4. MESD attack to find secret scalar d 



computations based on the Discrete Logarithm problem as there is no public 
exponent associated with the secret exponent, as in RSA. 



4 Usual DPA Countermeasures for Double-and-Add 
Algorithm 

The most well know countermeasures for scalar multiplication on elliptic curve 
have been published by Coron [3]. In this paper, the author describes three 
different countermeasures which are respectively based on the blinding of the 
scalar, on the blinding of the point or on the blinding of the multiplication. We 
now recall the description of the first two countermeasures. Then we explain, in 
section 5, how to defeat them. 



4.1 Coron’s First Countermeasure 

During the computation of a scalar multiplication, this scalar can be blinded by 
adding a multiple of the number £ of points of the curve. For this purpose, the 
algorithm needs a random value r which length is fixed to 20 bits in [3] . Then, 
the algorithm computes {d + r£)P which is obviously equal to dP. This coun- 
termeasure, depicted in figure 5, is very efficient since the scalar value changes 
for each computation. 



pick random value r 
d' = d + r£ 

return double-and-add(P,d') 



Fig. 5. Implementation 1 secure against DPA 
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4.2 Coron’s Second Countermeasure 

The second solution is based on the same idea as Chaum’s blind RSA signa- 
ture scheme. However, since we use a discrete log based problem, applying this 
method requires twice the time needed for a single scalar multiplication. To be 
more efficient, it is proposed in [3] to store a secret point R and the associated 
value S' = dR. The multiplication of P by d is performed by computing d{P+R) 
and then subtracting S' to the result. The variability is obtained by doubling R 
and S' at each execution as shown in figure 6. 



pick h £ {0, 1} at random 
R= {-lf.2.R 
S' = (-l)^2.5" 

S = double- and- add (P -|- R, d) 
return S — S' 



Fig. 6. Implementation 2 secure against DPA 



In this paper we will not focus on the third countermeasure proposed by 
Coron since it has been partially broken by Goubin in [5]. Our attack does not 
work on this countermeasure and does not allow to enhance Goubin’s attack. 

These two countermeasures are well admitted to be efficient against power 
attacks. Other recently proposed countermeasures, such as randomized NAF [6] 
or [7,17,8] mainly focus on improving efficiency in terms of speed. 

5 The New Attack 

We introduce a new attack mainly based on two reasonable assumptions. This 
attack is able to recover the secret scalar with a few requests to the card. The 
adversary needs to send chosen messages directly to the double- and- add algo- 
rithm. Indeed, when considering decryption of the ElGamal cryptosystem or of 
the RSA cryptosystem for instance, the padding can only be verified at the end 
of the computation. 

The idea of the attack is based on the fact that, even if an adversary is not 
able to tell which computation is done by the card, he can at least detect when 
the card does twice the same operation. More precisely, if the card computes 2. A 
and 2.B, the attacker is not able to guess the value of A nor B but he is able to 
check if A = B. Such an assumption is reasonable since this kind of computation 
usually takes many clock cycles and depends greatly on the value of the operand. 
This assumption has been used in a stronger variant and validated by Schramm 
et al. in [16]. Indeed, they are able to distinguish collisions during one DES round 
computation which is much more difficult than distinguishing collisions during 
a doubling operation. If the noise is negligible, a simple comparison of the two 
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power consumption curves during the doubling will be sufficient to detect this 
equality. 

If the noise is more important we propose two solutions to detect equalities. 
The first and easiest one is to compare the average of several consumptions 
curves with A and B. However asking twice the same computation may be 
impossible. In that case, a better solution is to use the tiny differences on many 
clock cycles since a point doubling usually takes a few thousand cycles. By 
summing the square of the differences between these curves on each clock cycle, 
we can decrease the influence of noise. This approach is precised in appendix A. 



5.1 Description of the Doubling Attack 

The so-called ^^doubling attacK’’ is based on the fact that similar intermediate 
values may be manipulated when working with points P and 2P. However this 
idea only works when using the downward routine. 

Let us first consider an example. Let d = 78 = 64 -|- 8 -I- 4 -|- 2, i.e. n = 6 and 

(do) di, d2, da, da, da, de) = (0, 1, 1, 1, 0, 0, 1) 

Then we compare the sequence of operations when the downward binary scalar 
multiplication algorithm of figure 3 is used to compute d x P (on the left) and 
d X (2P) (on the right) : 



i 


di 


comput. of dP 


comput. of d(2P) 


6 


1 


2x0 

0-bP 


2x0 

0-b2P 


5 


0 


2 X P 
2P + P 


2 X 2P 
4P-b2P 


4 


0 


2 X 2P 
4P + P 


2 X 4P 
8P-b2P 


3 


1 


2 X 4P 
8P + P 


2 X 8P 
16P-b2P 


2 


1 


2 X 9P 
18P-bP 


2 X 18P 
36P -b 2P 


1 


1 


2 X 19P 
38P + P 


2 X 38P 
76P -b 2P 


0 


0 


2 X 39P 
78P + P 
return 78P 


2 X 78P 
156P -b 2P 
return 156P 



If we focus on the doubling operations, we notice that some of them manipulate 
the same operand. More precisely, we observe that the doubling operation at rank 
i in the computation of dP is the same as the doubling operation at rank i — I 
in the computation of d(2P) if and only if di-i = 0. Consequently, all the bits 
(except the least significant one) can be deduced from the SPA analysis of only 
two power consumption curves, the first one obtained during the computation 
of dP and the second one with d(2P). 
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More formally, let us denote the partial sums Sk{P) = Y^lZo x P- 

This value is the content of 5'[0] in the algorithm 3 after k+1 iterations. Therefore 
we also refer to Sk{P) as an intermediate result of the binary scalar multipliation 
algorithm. So this value is used in the next doubling operation in any case. 
Besides 



k 

Sk{P) X P 

k-1 

= Y, X (2P) + dn-k X P 

= Sk-l{2P) + dn-k X P 

Thus the intermediate result of the downward double- and- add algorithm with 
P at step k will be equal to the intermediate result with 2P at step A: — 1 if and 
only if dn-k is null. 

Using the same example as before, we obtain : 



value of step k 


0 


1 


2 


3 


4 


5 


6 


value of dn-k 


1 


0 


0 


1 


1 


1 


0 


value of Sk{P) 


P 


2P 


4P 


9P 


19P 


39P 


78P 


value of Sk{2P) 


2P 


4P 


8P 


18P 


38P 


78P 


156P 



In conclusion, we just need to compare the doubling computation at step 
A; -I- 1 for P and at step k for 2P to recover the bit dn-k- If both computations 
are identical, dn-k is equal to 0 otherwise dn-k is equal to 1. This can also be 
observed by shifting the second measurement curve by one step to the right and 
comparing it to the first curve. Therefore, with only two requests to the card, it 
is possible to recover all the bits of the secret scalar. 

Note that this attack also works with addition-subtraction chains such as 
Non Adjacent Form representation [12]. It allows to recover all the zeros in 
the NAF coding which represent roughly two third of the bits according to the 
paper of Morain and Olivos [12]. The missing information can be recovered by 
exhaustive search or by a more efficient method such as an adaptation of the 
baby step giant step algorithm for short Diffie-Hellman exponent. Indeed, if the 
prime group order is a 160-bit prime number, then only 54 bits remain to be 
discovered. Moreover, the Baby Step Giant Step algorithm can be used to reduce 
the complexity of the discovery of the discrete logarithm to 2^^ in time and in 
memory. 



5.2 Application of the Doubling Attack to Coron’s Countermeasures 

The first countermeasure of Coron uses a 20-bit random value to blind the secret 
scalar at each request to the card. This size of random value is sufficient to 
resist usual DPA attacks. However it is not enough to resist our new doubling 
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attack. Indeed, due to the birthday paradox, after requests with P and 
requests with 2P, there should exist a common scalar with high probability. 
In order to recover this collision on the scalar, the attacker needs to compare 
each curve obtained with P to each curve obtained with 2P. With the method 
described before, assuming the same scalar is used, it is possible to find the 
position of zeroes in this scalar. The right pair will be distinguished as it will give 
a scalar with enough zeroes. Indeed, if the corresponding scalars are different, 
it is unlikely that common intermediate values appear on both computations. 
Hence identifying the right pair is quite easy and requires only 2^*^ comparisons. 
In case several bits of the scalar cannot be clearly identified due to the noise, 
they can be recovered by exhaustive search. The attack is summarized in figure 7 



while no correct pair is found 

request computation with P and store measurement C{P) in 
set A 

request computation with 2P and store measurement C{2P) 
in set B 

compare C[P) with all set B 
compare C{2P) with all set A 

if two measurements have many common intermediate squar- 
ing, a correct pair is found 

exhaustive search for undefined bits of the scalar recovered with 
the correct pair. 



Fig. 7. Attack of Coron’s first countermeasure 



If the number of undefined bits is too large, one can notice that the number of 
correct pairs increases as the square of the number of extra requests. With about 
2^® requests in each group, the number of correct pairs will be approximately 
2^° which may help to decrease the work of the exhaustive search. 

The second countermeasure is even more vulnerable to the doubling attack. 
Indeed, the random value which blinds P is itself doubled at each execution. 
The attack goes as follows : a point P is first sent as the first request. Then 
the card executes its routine with the value P + R. The adversary then requests 
the computation with the point 2P. With probability the card will use the 
point 2P + 2R = 2{P + R). So the attacker is then able to compare the two 
measurements and to recover the secret scalar. If the noise is too important, the 
adversary can use a statistical approach. He can choose a random point Q, send 
Q and 2Q to the card and make the difference between the first curve and the 
second one shifted by one step to the right. By summing the square of those 
differences, we can recover all the bits of the secret scalar. Indeed, when the bit 
at step i is equal to 0, half of the differences at step i + 1 are null. So the curve 
representing the sum of the differences will be flatter at positions corresponding 
to a zero than at positions corresponding to a one. 
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6 Conclusion 

A new powerful attack on scalar multiplication and modular exponentiation 
has been presented which takes advantage of some implementation choices that 
were not considered as a security concern up to now. As regards to this attack, 
it appears that the “bad” choice was the most commonly used, due to efficiency 
criteria. This vulnerability considerably weakens usual countermeasures used to 
defeat power attacks. 

Since no attack as efficient as the doubling attack is known on the upward 
double-and-add algorithm (from the least to the most significant bit), we rec- 
ommend to use this routine to compute scalar multiplication combined with the 
appropriate countermeasures. 

It is an open problem to study whether our attack and Goubin’s attack can 
be combined in order to defeat the combination of Coron countermeasures. 
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A Statistical Approach of Noise Reduction 

Let c denotes the number of cycles of a point doubling operation. More precisely, 
we only consider the cycles that are data dependant, i.e., for which the power 
consumption differs when operands are changed. The power consumption (with- 
out noise) of the computation with A and B at cycle i are respectively named 
CA{i) and CB{i)- The noise N can be modeled by random independent variables 
NA^i), NB{i) with mean fj, and variance cr. The power consumption observed on 
cycle i are then equal to C^(f) -I- NA{i) and -I- Afs(f). The indicator I is 

defined as follow: 

I=- V(Ca(z) + Na{{) - Cs(z) - iVs(i))" 
c 

= - - CB{^)r + - E(^a(z) - NB{t)r 

^ i=i ^ i=i 

2 ^ 

+ - y^{NA{i) - NB{i)){CA{i) - Cb{i)) 

i=\ 

If c is large enough, we can evaluate the mean of the indicator when A = B 
and A ^ B. In the first case, I = \ X)i=i(-^T(z) — NB{i)Y, which is a sum of 
assumed independent variables Y(i) = (NA{i) — The mean of Y{i) is 

E{Y) = E{{NA{i) - NB{i)f) 

= Var{{NA{i) - Nb{i)) - [E{Na{{) - iVs(z))]" 

= V ar{N A{i)) + Var{NB{i)) — 0 = 2cr^ 



So the mean of the indicator, ii A = B, is E{I{A = B)) = 2cr^. 
In the second case {A^ B), assuming that 



Vz < C,£i < |Ca(z) - Cb(z)| < £2 
the mean of the indicator can be bounded as follow 
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2(7^ + el< E{I{A ^ B)) < 2a^ + el 

With this bound in mind, it appears that the indicator can be used to distin- 
guish if the manipulated data are equal or not. The confidence on this indicator 
relies on its variance and the number of clock cycles c to perform the operation. 
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Abstract. Power analysis attacks on elliptic curve based systems work 
by analysing the point multiplication algorithm. Recently Goubin ob- 
served that if an attacker can choose the point P to enter into the point 
multiplication algorithm then none of the standard three randomiza- 
tions can fully defend against a DPA attack. In this paper we examine 
Goubin’s attack in more detail and completely discount its effectiveness 
when the attacker chooses a point of finite order, for the remaining cases 
we propose a defence based on using isogenies of small degree. 



1 Introduction 

Elliptic curves were first introduced into cryptography by Koblitz [9] and Miller 
[14] in 1985. Since that time, due to their perceived advantages in bandwidth 
and required computing resources, there has been increasing interest in using 
them in low-cost cryptographically enabled devices such as smart cards. 

Smart cards are a particularly interesting environment due to the ability 
for the attacker to mount side-channel attacks based on, for example, power 
analysis [10] and [11]. The idea behind these attacks is to measure the power 
consumption of the card and use this to derive information about the underlying 
secret key contained in the card. Such power attacks come in two variants; Simple 
Power Analysis, or SPA, uses only a single observation of the power to obtain 
information. Differential Power Analysis, or DPA, makes many measurements 
and then uses a statistical technique to deduce information about the underlying 
secret. 

In the context of elliptic curve cryptography, power analysis is applied to 
determine the multiplier used in a point multiplication. In other words for public 
P G E{K) and a private d G Z one uses power analysis to determine the value 
of d from the power consumed in computing 

Q = [d]P. 
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Since DPA requires multiple measurements this means one can only apply DPA 
to protocols in which one applies the same private multiplier d over multiple 
protocol runs, with possibly different values of P in each protocol run. Hence 
DPA can not be applied to ECDSA, two pass ECDH or two pass ECMQV. It 
can however be applied to ECIES, single pass ECDH or single pass ECMQV, 
where one of the “ephemeral” Diffie-Hellman/MQV keys is kept constant (i.e. it 
is a long term static public key). On the other hand, SPA can be applied to any 
algorithm in which one needs to keep the multiplier secret. Hence, SPA applies 
to all elliptic curve protocols. 

A number of ways of defending against SPA have been proposed in the lit- 
erature, for example the “double and add always” method of Coron [5], or the 
use of the Montgomery form [15] which helps prevent both SPA and timing at- 
tacks, [16], [17]. These defences try to prevent information leaking because of 
the different power profile of the addition and doubling formulae for the elliptic 
curve. 

An approach attracting increasing interest is to use group formulae which 
are identical for both addition and doubling. This idea was introduced in the 
context of the Jacobi form of an elliptic curve by Liardet and Smart [12]. This 
was extended to cover the Hessian form of a curve, see [7] and [19], 

+ 1 = Dxy. 

Note that the Hessian form curves are particularly efficient in characteristic three 
[20] , yet this advantage can only be exploited at the expense of having different 
routines for addition and doubling. Finally Brier and Joye have given a single 
formula for both the addition and doubling law for elliptic curves in standard 
form [4]. To recap, the standard form for a curve in characteristic two is given 

by 

y^ + xy = x^ + ax"^ + b, 

whilst in large prime characteristic it is given by 

y'^ = x^ + ax + b. 

For efficiency reasons it is common to select a = 1 in characteristic two and 
a = — 3 in large prime characteristic, see [3] for the reasons, and many curves 
recommended, or mandated, in standards documents satisfy these extra condi- 
tions on a. 

Yet SPA defences are not enough to prevent DPA attacks. Coron [5] proposed 
three possible DPA defences namely; randomizing the secret exponent d, adding 
random points to P to randomize the base point, using a randomized projective 
representation. Only the third of these can be done with minimal cost, whilst 
the other two are not as effective and add additional computational costs into 
the point multiplication algorithm. In a similar vein to randomized projective 
coordinates Joye and Tymen introduced two other cheap randomizations, namely 
random curve isomorphisms and random field isomorphisms [8]. 
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So a combined approach of using indistinguishable group laws and a random- 
ization of (at least one of) the projective point representation, the curve rep- 
resentation or the field representation; would appear to offer a defence against 
power analysis for elliptic curve systems. However, recently Goubin [6] observed 
that if the attacker can choose the point P to enter into the point multiplication 
algorithm then none of these three randomizations can fully defend against a 
DPA attack. 

In this paper we examine Goubin’s attack in more detail and discount its 
effectiveness in a large number of cases, for the remaining cases we propose 
a defence based on using isogenies of small degree. The paper is organized as 
follows: In Section 2 we describe Goubin’s attack and his notion of “Special 
Points” and examine the three anti-DPA randomizations mentioned above. We 
divide the special points into two types, those of small order and those of large 
order. Then in Section 3 we explain how careful implementation of existing 
standards definitions means we need not worry about special points of small 
order. Then in Section 4 we recap on various aspects of isogenies on elliptic 
curves over finite fields. We then use these isogenies to propose a defence for 
special points of large order in Section 5. In addition we examine whether our 
proposed defence works for the elliptic curves recommended or mandated in 
various standards. Finally we end in Section 7 with some conclusions. 



2 “Special Points” 

Before presenting Goubin’s refined power analysis attack we present the three 
standard randomized DPA defences mentioned in the introduction. 

Let C{X,Y,Z) denote a projective representation of the affine elliptic curve 
we are using in our cryptosystem, whose affine form we shall assume is monic in 
Y. There is a map from affine coordinates to projective coordinates 

(x,y) I — {x,y,l) 

and a similar reverse one 

(X,Y,Z)^{X/Z\Y/Z*) 

where s and t are the “weights” of the projective representation. Note: As above, 
we shall use lower case letters to denote variables on the affine form of the curve 
and capital letters for the projective form. 

The three proposed randomization defences against DPA are as follows: 



2.1 Randomized Projective Coordinates 

Here one takes the affine point P = (x, y) and before we apply d to it we first 
map it into a projective representation, using a random r G K*, 

(x,y) I — {xP,yr*,r). 
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One then performs the point multiplication in projective coordinates. Since mul- 
tiple runs of the protocol will result in different values of r we see that each run 
will be uncorrelated with other runs and so a DPA attack seems impossible to 
mount. 

2.2 Randomized Curve Isomorphism 

Here we have P G C given to us and we then define P' = (r®x, r*y) for some 
random r G K*. We then consider P' as a point on C where if C is given by 

c = X! 

then C' is given by 

c = 

with 

and V chosen so as to make C monic in the y. The curves C and C are isomor- 
phic. In our cryptographic operation we now compute Q' = {X' ^Y') = [d\P' on 
C and then map this back to C via Q = {X, Y) = (A'/r®, Y' jr*). 

2.3 Randomized Field Isomorphism 

Here we take P G C and apply a random field isomorphism k : K ^ K' to both 
P and C so as to obtain P' = n{P) and C = k{C). We then compute Q' = [d]P 
and then compute Q = k~^{Q'). 



Goubin [6] defines a special point P = {x,y) G C tohe one in which either x = 
0 or ?/ = 0. Goubin’s attack works by feeding suitable multiples P', depending 
on ones guess for a given bit of d, of a special point into the smart card. Then 
when the smart cards computes [c^|P^ the special point will occur within the 
computation assuming the guess is correct. The existence of the special point 
will be picked up with a DPA trace since the property of being a special point 
is preserved under the three randomizations above. 

Elliptic curves in cryptography are usually chosen to have order 

#E{K) = h-q 

where g is a large prime and h is a, small integer called the cofactor. In practice 
one usually has h chosen from the set {1,2, 3, 4, 6}. The values of h correspond 
to the orders of the small subgroups of E{K). We say that a special point has 
small order if it has order dividing h, otherwise we say it has large order. 

In Table 1 we examine the various cases for different curves. A “?” in the 
Order column denotes that the point could be one of large order, whilst a “?” 
in the characteristic column means the curve can be used in any characteristic. 
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Table 1. Table of special points on Various Elliptic Curves 



Curve 

Equation 


Char 


Special 

Point 


Order 


+ xy = + ax'^ b 


2 


(0,0) 


2 


y^ + xy = x^ + ax^ + b 


2 


(0,0) 


7 


y‘‘ = x'^ + ax + b 


> 3 


(0,0) 


2 


y^ = x^ + ax + b 


> 3 


(0,0) 


7 


x'^ + y'^ + 1 = Dxy 


7 


(0,0) 


3 


x^ + y^ + 1 — Dxy 


7 


(0,0) 


3 



3 “Special Points” of Small Order 

Special points of small order can be dealt with by careful implementation of 
the protocols used in elliptic curves. Note, that since Goubin’s attack is a DPA 
attack we need only consider protocols in which the same secret multiplier is 
used on multiple runs of the protocol. Hence, we only need to consider protocols 
such as ECIES, single-pass ECDH and single-pass ECMQV. To deal with small 
subgroup attacks various standards for these protocols make use of the co-factor 
ft, as a final postprocessing step before any point multiple is used in a protocol, 
see [1] or [2]. 

For example in the one-pass Diffie-Hellman protocol; if Alice has the long 
term key a and Bob sends her the ephemeral public key P, then Alice will 
compute Q = [a]P followed by the (optional) postprocessing of [h]Q. If the 
cofactor is used then one calls the protocol cofactor-Diffie-Hellman. It is step of 
computing Q = [a]P which is used by Goubin in his power analysis attack, by 
sending Alice a special point P of small order. 

If we insist on implementors using the cofactor variant of Diffie-Hellman then 
we can avoid Goubin’s attack by simply reversing the order of multiplication by 
a and ft. In other words Alice first computes Q = [ft]P and then computes the 
shared secret via [aJQ, if and only if Q yf O. Goubin’s attack then no longer 
applies since only genuine points in the subgroup of order q are passed into the 
point multiplication routine with the secret exponent a. 

Similar arguments, involving insisting on cofactor variants of all protocols 
and reversing the order of multiplication by the cofactor and the secret key, can 
be applied to EGMQV and EGIES as defined in [1] and [2]. 

Recently Shoup has proposed a new variant of EGIES [18] for inclusion in 
a draft ISO standard. This new variant processes the cofactor in a completely 
different way to the old version of EGIES. A quick look at the new EGIES reveals 
that the new version processes the cofactor before the secret key multiplication 
as we recommend, hence the new version is already protected against Goubin’s 
attack for special points of small order. 
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4 Recap on Isogenies 

We recap, for use in the next section, some basics on isogenies between elliptic 
curves. All of what we require can be found in [3]. 

Let El and E 2 denote elliptic curves over a finite field K of characteristic p. 
An isogeny 

Ip : E\ — > E 2 

is a non-constant rational map which respects the group structure of Ei and E 2 , 

i.e. 

ipiP + Q) = V'(-P) + V'(<5)- 

Every isogeny has a finite kernel and the size of this kernel is called the degree of 
the isogeny. If E\ and E 2 are isogenous then we have that #Ai(A') = =fpE 2 {K). 

If ji and j 2 are the j-invariants of the two curves then an isogeny of prime 
degree I exists (over the base field) if and only if ji and j '2 are a solution of the 
modular equation of degree /, i.e. 

= 0 . 



The equation <Pi{X, F) = 0 defines the modular curve Xq{1) which parametrizes 
all elliptic curves for which there is a degree I isogeny between them. 

These modular equations ^i{X, Y) grow large quite quickly as one increases 
1. This has led to the introduction of more suitable modular curves (and hence 
equations) for larger values of I (say I > 41). But, since in our application we 
are only interested in small values of I, we will not consider these generalized 
modular equations and so will restrict ourselves to the standard modular curves. 

As a subprocedure of the Schoof-Elkies- Atkin algorithm [3] [Chapter VII] one 
takes an elliptic curve E\ and then determines whether there is an elliptic curve 
E 2 which is Lisogenous to E\. This is done by solving the modular equation 



^i(A,ji) = 0 



over the field K . One can then determine E 2 and then, via a rather involved pro- 
cedure, determine the mapping p). See [3] [Chapter VII] for the precise algorithm 
for determining E 2 and ip. 

The mapping ip for a degree I isogeny is of the form 



Ip : 




E2 



fljx) V-f2(x) ^ 
g(xY ’ g{xY ) 



where /i, /2 and g are polynomials over K of degree 2d -|- 1, 3d -I- 1 and d, for 
d = (/ — l)/2 respectively. 



5 “Special Points” of Large Order 

In characteristic two the special points {9, 0) of large order can easily be defended 
against by use of the Montgomery ladder [13], since in that case the y-coordinate 
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is not used. Hence, we will restrict our discussion to large prime characteristic and 
to curves in Weierstrass form, since these are more important for applications. 

Special points of large order have been shown to exist by Goubin on a large 
number of the curves of large prime characteristic defined in standards. The 
existence of special points of large order is due to the equation 

+ ax + h 

being such that 6 is a square in F*. We propose to manage the problem of 
special points of large order by transferring the cryptographic protocol over to 
an isomorphic group (but not an isomorphic curve) via an isogeny 

f/' : El — >• E2- 

Note, the curve E 2 and the isogeny we will use are all defined over the base field 
Fp. For each curve in the standards, which exhibits a special point, we then need 
to determine a fixed (low degree) isogeny to an elliptic curve which does not 
exhibit a special point. 

In Table 2 we list all recommended/mandated curves over fields of large prime 
characteristic in the main standards. For each curve we list the minimal degree 
of an isogeny to a curve which does not exhibit a special point. If the original 
curve does not exhibit a special point then we specify this degree as one. We also 
list the degree of the minimal isogeny to a curve one would prefer, i.e. a curve 
which has a particularly efficient model for computational purposes, by this we 
mean in odd characteristic a model of the form 

= x^ — ‘ix + b. 

If the curve has minimal isogeny degree one and if the original curve was not of 
the above special form then we do not give a figure for the preferred minimal 
degree. 

The data in Table 2 was computed via the Magma computer algebra system. 
Given a curve and the minimal isogeny degree it is relatively straightforward, see 
[3] [Ghapter VII] , to compute the equation of the isogenous curve and the isogeny 
itself using Vein’s formulae [21]. Indeed Magma will compute this isogeny for you 
if required. 

Hence, all that the smart card need do to protect against special points of 
large order is to store along with the original curve from the standard, the equa- 
tion of the isogenous curve and the equation of the isogeny and its inverse. Then 
input points can be mapped over to the isogenous curve for computation and 
then mapped back again to the original curve for further processing. Glearly 
all the standard defences such as randomized projective representation and ran- 
domized curve isomorphisms can then be applied to the computation on the 
isogenous curve. 
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Table 2. Minimal Isogeny Degree Needed to Remove a Special Point 



Curve 

Name 


Minimal Isogeny 
Degree 


Preferred Minimal 
Degree 


secpll2rl 


1 


1 


secpll2r2 


11 


11 


secpl28rl 


7 


7 


secpl28r2 


1 


- 


seep 1 60k 1 


1 


- 


secplGOrl 


13 


13 


secpl60r2 


19 


41 


seep 1 92k 1 


1 


- 


secpl92rl 


23 


73 


seep224kf 


1 


- 


secp224rl 


1 


- 


seep256kl 


1 


- 


secp256rl 


3 


11 


secp384rl 


19 


19 


secp521rl 


5 


5 



6 Relative Cost of the Isogeny Defence 

To apply the isogeny defence it would be better to alter the standards so that 
the curves are replaced with isogenous ones. However, since this is unlikely to 
be an option the smart card needs to convert the input point to the isogenous 
curve. If we assume the isogeny is of degree I this means we need to evaluate 
three polynomials of degree 2d + 1, 3d + 1 and d, where d = {I — l)/2. Using 
Horner’s rule this implies a maximum number of field multiplications of 

(2d + 1) + (3d + 1) + d = 6d + 2 « 3L 

Of course one problem is to actually store the coefficients of the polynomials 
defining the isogenies, which could be a problem in a device with limited memory. 

In [5] Coron mentions other defences against DPA which could be used to 
thwart Goubin’s attack. We discuss each of these in turn: 



Randomization of the Private Exponent: Here one sets d' = d+k-^E(¥p) 
for some random 20-bit (say) value k. One then computes Q = [d']P. On average 
this will require an additional 



2QMd -\- \QMa 

field multiplications, where Md is the number of field multiplications required 
in an elliptic curve doubling operation and Ma is the number of field multiplica- 
tions required in an elliptic curve addition operation. Typically, with projective 
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coordinates, we have Mu = 8 and Ma = 16. Hence, one requires on average 320 
extra field multiplications, which becomes more efficient than the isogeny method 
as soon as I > 106. Whilst this method appears to be slower than the isogeny 
method, it should be noted however that randomizing the private exponent is 
easier to implement than the isogeny method. 

Point Blinding: Here one first computes S = [d]R, for some random value 
of R of large order. Then at each request for the calculation of Q = [d]P one 
computes 

i?= (-l)^2i?, 5*= (-1)*'2S' 

and 

Q = [d]{P + R) - S. 

Hence, at each iteration one needs to perform two extra point additions and two 
extra point doublings. This corresponds to a typical cost of 48 field multiplica- 
tions, which is more efficient than the isogeny method when I > 16. 

As we have already mentioned, Coron also proposed the use of randomized 
projective coordinates. This does protect against other forms of DPA, but not 
Goubin’s attack. Hence, combining randomized projective coordinates and the 
isogeny method one could achieve an efficient defence against all known forms 
of DPA against an elliptic curve implementation. 

7 Conclusion 

We have shown why Goubin’s refined power analysis attack can be discounted 
for many elliptic curve systems either by use of the cofactor variant of many 
protocols or the use of isogenies. 

The author would like to thank Marc Joye for useful comments on an earlier 
draft of this paper. The observation that the Montgomery ladder removes the 
need to consider special points of large order in characteristic two is due to him. 
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Abstract. We investigate side-channel attacks where the attacker 
only needs the Hamming weights of several secret exponents to guess a 
long-term secret. Such weights can often be recovered by SPA, EMA, or 
simply timing attack. We apply this principle to propose a timing attack 
on the GPS identihcation scheme. We consider implementations of GPS 
where the running time of the exponentiation (commitment phase) 
leaks the exponent’s Hamming weight, which is typical of a square and 
multiply algorithm for example. We show that only 800 time measures 
allow the attacker to find the private key in a few seconds on a PC with 
a success probability of 80%. Besides its efficiency, two other interesting 
points in our attack are its resistance to some classical countermeasures 
against timing attacks, and the fact that it works whether the Chinese 
Remainder Technique is used or not. 

Keywords: Side-Channel Attacks, Timing Attacks, GPS, Identification 
Schemes 



1 Introduction 

Timing attacks are certainly less powerful than power or electromagnetic analy- 
sis. On the other hand, the very limited equipment they require and the simplic- 
ity of the measurements make them much easier to deploy, even for a non-skillful 
adversary with very limited resources. Moreover, there are situations in which 
power consumption or electromagnetic radiations cannot be measured while the 
running time may be obtained, for example by measuring the delay between 
question and answer [3]. 

GPS is an identification scheme initially proposed by Girault [7]. It was re- 
cently selected in Nessie’s portfolio of cryptographic primitives [4]. This protocol 
was designed for smart cards, allowing fast identification even on low-cost pro- 
cessors. The security of GPS was proven in [11]. The scheme is complete, sound 
(under the hypothesis that computing discrete logarithms with small exponents 
modulo n = pq is hard), and statistically zero-knowledge. Gareless implementa- 
tions might nevertheless be subject to side-channel attacks. 

Timing attacks against exponentiation schemes have been known for several 
years, and various countermeasures have been proposed against them. However, 
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since some of these countermeasures are precisely based on randomizing the 
exponent, one may feel tempted to believe that an algorithm where the exponent 
is random by nature, and where a different exponent is used for each execution 
may not be subject to a timing attack. We show in this paper that this is not 
true. 

We propose a timing attack on GPS allowing recovering the prover’s private 
key provided the exponentiation’s running time is dependent (in our example, 
linear) in the exponent’s Hamming weight. In our scenario, the attacker imper- 
sonates the verifier, and is able to measure precisely the computation time for 
the commitment step. Apart from this, the attacker has no knowledge of the 
implementation, such as multiplication algorithm, time needed for an individual 
multiplication, . . . 

To our knowledge, none of the previously known timing attacks needed only 
the leakage of the exponent’s Hamming weight ([5] and [12], for example, were 
based on modular reductions occurring in Montgomery multiplications). As a 
consequence, several of the countermeasures proposed against timing attacks 
turn out to be useless in our context. 

Lastly, previous timing attacks required big amounts of data to work. Our 
attack only needs 600 timings to obtain the key with a probability of success of 
72% after a few hours of treatment on a single PC, while 1000 timings allow 89% 
success after a few seconds of treatment. Collecting such a number of timings is 
easy (we recall that [8] states that the total time of the commitment followed by 
the answer takes less than 100 milliseconds with a crypto-processor card). 



Notations. For any integer 6, we denote by |6| its size in bits, and by bj the 
j-th bit of b, starting from the least significant bit, i.e. b = b\b\-i • • • &o- 
For an interval I, b Gu I means that b is chosen at random in I. 

1 © m denotes the addition I + m modulo 2. 

2 The GPS Identification Scheme 

This section briefly reminds the principle of GPS (see [2] for more details) then 
focuses on the commitment step. 

2.1 Short Description of GPS 

An authority generates two strong primes p and q and computes n = pq. It also 
chooses an integer g. GPS authors [8] recommend the value g = 2, as it fits the 
security requirements while allowing good performance. Note that this choice 
has no effect on our attack. 

Three parameters A, B and S are needed. We recall their minimum sizes as 
recommended in [8]. 

— S' is the size of the (long-term) private key, jSj > 160 

— B is the size of the challenge, \B\ = 16, 32 or 64 
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— A is the size of the ephemeral keys, \ A\ = |_B| + 15'| + 64 or |A| = \B\ + [S'! + 80 

We set E = A + {B - 1){S - 1). 

Two cases may occur: 

— Each user has his own modulus. In this case, \n\ should be at least 1024, and 
the factors of n, p and q, can be revealed to the prover, allowing her to use 
the Chinese Remainder Technique to speed-up the commitment phase. 

— Several users share the same modulus n. In that case, |n| should be at least 
2048. 

The prover ’s private key is a random element x G [0, S'[. Her public key is 
X = g~^ mod n. An identification with GPS proceeds as follows: 

1. Commitment: The prover picks a random y G [0,A[. She computes + = 

mod n and sends Y to the verifier. 

2. Challenge: The verifier sends a random integer c G [0, B[ to the prover. 

3. Response: The prover checks that c G [0, B[ and computes z = y + cx. She 
sends z to the prover. 

4. Verification: The verifier checks that z G [0, E[ and g^X‘^ = Y (mod n). 



Prover 



Verifier 



Pick y Gu [0, A[ 
Y = mod n 



Check c G [0, B[ 
z = y + cx 



Y 

>• 

Pick c Gu [0, B[ 

C 

i 



z 



Check z G [0, E[ 
g^X'^ = Y (mod n) 



Fig. 1. A round of GPS 



2.2 About the Commitment Step 

Two methods are possible to manage the commitment issue. The first one con- 
sists in pre-computing commitments outside the card and then storing them in 
the card, using several tricks to save memory space (see [2] for details). This 
solution is interesting since it can be implemented on a card without crypto- 
processor, but it becomes a problem for applications that require a lot of iden- 
tifications to be performed (such as pay-TV, for example): the card has to be 
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reloaded with fresh commitments in a secure way periodically, or can only per- 
form a limited number of identifications (an optimized version with 6000 pre- 
computed commitments needs about 36 kb of memory space) . 

The second method consists in making the card compute the commitments 
itself. This solution allows it to perform enough identifications for any applica- 
tion. Two variants of this method are possible: the computation can be done 
online, i.e. during the identification, or offline. The online variant requires a 
crypto-processor. In that case, the commitment can be computed in less that 
100 milliseconds [8]. In the offline variant, the card computes the commitment 
before the execution of the identifcation itself. Speed is not an issue anymore, 
thus no crypto-processor is needed; however, the card needs to be supplied with 
current - and therefore, for classical smart cards, to be connected to a reader - 
during this computation. 



3 Context and Attack Overview 

In this section, we introduce the context of our attack, then draw its main 
principles. 



3.1 Scenario 

We assume that the attacker is able to accurately measure the computation 
time of the commitment. We furthermore assume that this computation time 
is roughly linear in the Hamming weight of the exponent. This section briefly 
discusses the realism of these assumptions. 

First of all, this will of course only be possible if the commitments are com- 
puted by the card, either online or offline. In the offline case, running times may 
be difficult to obtain by direct measurements, but indirect methods (e.g. based 
on power consumption) are still possible. 

We also assume that the attacker is able to impersonate a honest verifier, 
which is a weak assumption: as resistance to dishonest verifier attacks is a natural 
requirement for an identification scheme, there is usually no need to bother about 
the verifier’s identity. In its core version, GPS does not perform any verification 
step on the verifier before entering the commitment phase. 

As far as accuracy of the measurement is concerned, we believe our assump- 
tion to be realistic, for example in the context of a smart card (a typical target 
for side-channel attacks) since smart cards are usually not equipped with an 
internal clock, but have their clock signal provided by the reader they are put 
in; in this context, accurate running time is easy to obtain with a rogue reader. 
To resist side-channel attacks, some modern smart cards get equipped with an 
internal clock. However, our attack seems to be robust against small imprecision 
in running time, and could therefore be carried out simply by measuring the 
elapsed time between startup and commitment. 

Finally, the running time will clearly be a linear function of the Hamming 
weight of the exponent in the case of a classical square and multiply algorithm. 
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Other methods, such as sliding windows or Walter’s division chains [15], are 
discussed in section 6. 

Remark: For the sake of simplicity, we will assume in this description that 
A is a power of 2 (in fact, it is likely that many implementations choose this 
value, because it makes generation of random elements in [0, A[ easier). Taking 
this value allows us to use a very simple formula linking the Hamming weight of 
the ephemeral key y and the probability for one of its bits to be 0. However, a 
different A makes computations more fastidious, but has no effect on the attack’s 
efficiency. 

3.2 Test Platform 

To validate our attack, we performed timing measurements using a smart card 
development kit [10,1]. Although not strictly practical, this scenario should be 
very close to the reality. Since the emulator is designed to allow implementors to 
optimize their code before “burning” actual smart cards, its predictions almost 
perfectly match the smart card’s behaviour. We therefore believe physical attack 
of an actual smart card should not induce much more measurement errors than 
the ones we encountered here. 



3.3 Attack Overview 

The core principle of the attack goes as follows: suppose the attacker has obtained 
w{y), the Hamming weight of some y generated by the prover. Impersonating 
the verifier and sending the challenge c = 1, he has also obtained the prover’s 
response, z = y + x. 

He starts by attacking the least significant bit of the secret, xq. If the Ham- 
ming weight of y is smaller than half of y’s bit length, he may assume that the 
least significant bit of y is more likely to be equal to 0 than to 1. More precisely, 
he assumes that the probability Pq = -P(j/o = 0) is equal to 1 — 

Therefrom, and knowing the value of z, he can easily deduce a probability 
for the least significant bit of x to be zero. 

Of course, this single guess has a non-negligible chance to be wrong. However, 
if the attacker is able to repeat this experiment over several identifications and 
compute the average of these probabilities, then this error risk shrinks. 

Once he has obtained xq, the attacker can deduce the actual value of the 
least significant bit 7/g*^ of every message of his sample, which will allow him 
to attack the second bit xi. Note that it is not necessary to collect new sets of 
samples. 

The next section takes a deeper look at the attack. 

4 How to Recover the Key: Full Details 

The attack proceeds in three phases: time measurements, computation of a can- 
didate for the key, key recovering. 
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4.1 Phase I: Time Measurements 

In phase I, the attacker impersonates the honest verifier k times, always sending 
the same challenge^ c = 1, and collects a list of couples i = l..k, where 

is the computation time for the commitment of the z-th interaction, and 
is the corresponding answer + x). 

Assuming the exponentiation’s running time is roughly a linear function of 
the Hamming weight, i.e. 



= a X + 13 + 

with unknown parameters a and (3 represents the error), the attacker esti- 
mates the Hamming weights of the set of ephemeral keys. 

Figure 2 gives a clear idea of how parameters a and /? can be estimated. 
The left graph shows the sorted running times of 80 exponentiations performed 
on our test platform (with the following parameters: g = 2, |n|=1024, A = 
2^"^° and without CRT) and the right graph shows the corresponding Hamming 
weights. Linear regression techniques on the left graph immediately provide good 
estimates for a and (3. 

These figures might deserve a bit more attention. As we can see, running 
times are grouped by “steps”, and the attacker can safely assume that these 
“steps” correspond to several exponents having the same Hamming weight, 
whereas the average height between two “steps” is the time for a modular 
multiplication (i.e. a). The right graph shows that this estimate is pretty 
accurate: actual Hamming weight is rarely further than 2 from its estimate. 

Remarks: 

— We purposely took a small number of samples in order not to overload the 
figures. In practice, using a much larger set (say, 1000 samples) will of course 
reduce the error. 

~ We emphasize that this estimation is possible whether the Chinese Remain- 
der Technique is used or not. Without CRT, the exponentiation time leaks 
information on w{y). But with CRT, the card first computes mod p then 
gy mod q (since y p,q, we have y = y mod {p — 1) = y mod {q — 1)), and 
the timing leaks information on 2 x w{y). 

— Smart card implementations of GPS probably require a constant time be- 
tween the insertion of the card in the reader and the beginning of the com- 
putation of the commitment, meaning that the attacker gets a list (t*-*) + R), 
where R is constant and unknown to the attacker, instead of a list (t^*^). 
Thanks to the linear regression, this has no effect on the attack’s efficiency. 



^ Always sending c = 1 improves both the precision and the simplicity of the attack, 
but this is not a necessary condition. For example, the principle of the attack also 
works when the verifier always sends powers of two as challenges: the pair (w(y), z = 
y -|- x) will leak information on bits a;|s|_i . . .Xv of x. 
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Fig. 2. Sorted timings for 80 samples and corresponding Hamming weights 



4.2 Phase II: Computing a Candidate for the Key 

In phase II, the attacker uses the collected data to recover a candidate 

X for the private key. 

Remark: Our procedure obviously works independently from the way the at- 
tacker obtained the Hamming weights. Although we put ourselves in the context 
of a timing attack, other methods, such as power or electromagnetic analysis, 
would work as well. 



Estimation of P{y) 



G) _ 



0 ). At step j of the procedure, we must estimate 

b _ of — 



P{Vj = 0) for each i = 1, . . . ,k. A basic estimation would be P{yj = 0) = 
1 — (wb)/|H|), but we use two tricks to refine it: 

— since yb) is at least 64 bits longer than x, the first bits of yb) are very 

I A (i) (i) 

likely to be the same as the first bits of We have Z\a\-i ■ ■ ■ -^iVl+io “ 

ii) (i) 1 

y\A\-i ■ ■ ■ 2^|S|+io probability 1 — For simplicity, we will assume in 
the following that this relationship always holds. 

— if we already know Xj-\ . . . xg , then we can use z^) to compute ■ ■ ■ Uq'^- 

This leads to the following estimation of = 0): 



P{yf = 0) = 1 - 



l^l + lO-j 



with 



«^(2/|S|+9 • ■ • yf) = ■ ■ ■ ^|s|+io) - w{yf_^ ■ . . 



Our procedure will only store k values corresponding to the weight of 
the part of j/b) that has not been guessed yet. 
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Guessing Xq. For each i = 1, . . . , /c, we have = y^^'^+x. Thus = 
Therefore we have P{xq = 0) = P{zq^ = For each z, we estimate 
P{Vo^ = 0). then 



yg*^©a;o. 



P{Vo^ = 0) if 4*^ = 0 
1 - P(y(h = 0 ) if = 1 



Using all the couples z^)), we can use the following estimation: 

P{xo = 0 ) = ^ ^ P(zo^ = Vo^) 

^ i=l 

If we get P{xo = 0) > 1/2, we set xq = 0; else we set xq = 1. 



Guessing Xj for j > 0. To try to guess Xj for / > 0, the situation is a little 

(i) (z) (i) 

different, because we now have to deal with a carry: 4 =2/} ® xj ® carry^ . 

We use our previous guesses ,Xq) to guess the carry. As we already 

explained, we also use (xj-i, . . . ,xq) to refine our estimation of P{y^j^ = 0). 
Each couple leads to an estimation of P{xj = 0): 

P{xj = 0 ) ~ PiUj'^ = ® carry^^^) 

We take the mean of these estimations for z = 1, . . . , fc. 

1 * 

P{xj = 0) = ^ X/ P^yT ^ ® carrz/^-*^) 

^ i=l 

Again, if we get P{xj = 0) > 1/2, we set xj = 0; else we set Xj = 1. 

Our procedure FindC andidate stores the list carryP\i = 1,... ,k, where 
carry^^^ equals carryj''}_^ at the beginning of procedure PartialEstimate{i, j) 
(i) 

and carry J at the end of this procedure. 

Remark: If the estimation is wrong at step j, meaning that Xj ^ Xj, the main 
consequence is an error on the next carry It is easy to see that this error will 
essentially propagate like a carry error in a classical addition, meaning that only 
a small block of the output bits will be erroneous. Our experiments confirmed 
this fact. 

^ Actually, a wrong estimation on Xj also induces an error on the updated estimation 
wb). This error is small and does not affect the success of our attack. 
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Algorithm 1 PartialEstimate{i,j): computes the f-th estimation of P{xj 
and the carry carry^p 



yf-i ■■= aj'ii © carryj'ii 



if Xj-i + + carry'j^^ 



(i) 



(i) 

carry j .= 1 
else 

(i) ^ 

carry j ' := 0 

end if 

Clear carry^^}_^ 
Store carryj'^ 






> 2 then 



Pf :=l-(n;«/(|5| + 10-i)) 
if © carry^’’^ = 0 then 
Return Pj*^ 
else 

Return 1 — pj*^ 

end if 



0 ) 



Algorithm 2 FindC andidate: computes x 
for i=l to k do 

— ^(i) _ w(2|W 

end for 

for j = 0 to I S'! — 1 do 
P := 0 

for i = 1 to fc do 

P := P + PartialEstimateii, j) 

end for 
P := P/k 
if P > 0.5 then 

Xj ■- 0 

else 

Xj 1 

end if 
Store Xj 
end for 

Return T|s|-i ■ ■ -Xq 



4.3 Phase III: Using the Candidate to Find the Key 

At the end of Phase II, the attacker finds a candidate x. If g~^ = X (mod n), 
the attack is a success. Even ii x ^ x, the attacker can make an exhaustive 
search on values x' such that d(x',x), the Hamming distance between x' and 
X, is small, testing whether g~^ = X (mod n). The maximum distance such 
that the attack can succeed obviously depends on the capacity of the attacker; 
practical values are given in the next section. 
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5 Practical Results 

We chose the following values to perform our tests: g = 2, \n\ = 1024, A = 
then used the test platform described in section 3.2 to obtain running times. We 
did not use the Chinese Remainder Technique. 

For Phase III, we assumed that the attacker can perform 400 tests per second 
(which roughly corresponds to computations on a 2 Ghz PC with the CMP 
libray). This leads to the following classification for the candidates: 

1. If X = a;, a; is already the correct key 

2. d{x,x) < 2, the attacker will need less than 16 seconds on average to find 
the correct key; we say that x is a “seconds” candidate 

3. d{x,x) < 4, the attacker will need less than 9 hours on average to find the 
correct key; we say that x is a “hours” candidate 

4. d{x, x) < 5, the attacker will need less than 12 days on average to find the 
correct key; we say that x is a “days” candidate 

Our attack’s results are summarized in Table 1. 



Table 1. Simulation Results 



k (number of samples) 


200 


400 


600 


800 


1000 


Immediate keys 


0% 


2% 


21% 


52% 


72% 


“seconds” candidates 


0% 


3% 


54% 


80% 


89% 


“hours” candidates 


0% 


6% 


72% 


94% 


97% 


“days” candidates 


0% 


10% 


77% 


96% 


98% 


avg. distance d{x, x) 


45.97 


16.11 


3.88 


1.43 


0.7 



Phase II itself completes in a few seconds. When the number of samples is 
very small, it may be worth noticing that even if the attack does not produce 
directly exploitable results, it does anyway reveal substantial information about 
the key. 

6 Countermeasures 

Several countermeasures against timing attacks are known today. However, some 
care must be taken, as they will not all be efficient against our attack. 

First of all, most timing attacks known so far exploited the modular reduction 
occurring in a Montgomery (resp. Barrett, . . . ) multiplication. Therefore, several 
countermeasures [6,13,14] typically consisted in removing the time variation in 
this multiplication. Since our attack is not based on the same property, these 
countermeasures are pointless. 

Another frequently suggested countermeasure consists in randomizing the 
exponent by adding a random multiple of ip{n) to it, a modification that does 
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not affect the final result. However, GPS, that was designed for efficiency, uses a 
short (typically, between 240 and 304 bits) exponent. Clearly, adding a multiple 
of ip{n) (typically, 1024 or 2048 bits) will imply a very serious performance 
drawback. 

Simple sliding window techniques are probably too basic to make the at- 
tack impossible (in short, there is still a correlation between running time and 
Hamming weight). 

Nevertheless, some other countermeasures are possible. 

Pre-computed Commitments A natural countermeasure is to use pre-com- 
puted commitments. This obviously makes the attack impossible; the advan- 
tages and the drawbacks of such a method have already been discussed in 
section 2.2. 

Square and Multiply Always Using dummy multiplications during the ex- 
ponentiation allows to hide the hamming weight of the ephemeral keys. This 
countermeasure increases the computation time by about 30%. 

MIST algorithm: A much more efficient countermeasure could be the use of 
Walter’s MIST exponentiation algorithm [15]. MIST is not strictly constant- 
time (the greater the exponent, the longer the division chain). However, the 
information it could leak is probably limited to the strong bits of the expo- 
nent [16], which are already known to any eavesdropper (as they correspond 
to the strong bits of the answer). Thus MIST, which was designed to resist 
power analysis for RSA-like systems, i.e. when a same exponent is used many 
times, seems to fit out needs with GPS, where a new exponent is used for 
each commitment. 

Remark: We were also suggested to modify the pseudo-random number gen- 
erator, as a possible countermeasure. The most straightforward way to do this is 
to take A = 2“ and ensure that the outputs y G [0, 2“[ of the generator are such 
that w{y) = a/2. The corresponding information on individual bits of y is null. 
However, we believe this countermeasure must be considered with great caution: 
the security of identification schemes such as GPS is strongly related to the ran- 
domness of the commitments, and tampering with “randomness modifications” 
is always very risky. 



7 Conclusion 

We proposed a strategy for a side-channel attack on GPS, and showed that it 
is realistic and efficient in a timing attack context. We believe that several re- 
finements of the attack are still possible to further improve its efficiency. As 
usual with side-channel attacks - that do not target cryptographic primitives 
themselves, but rather specific implementations - this paper does not question 
the security of the GPS primitive. Instead, we showed how straightforward im- 
plementations can easily be broken, and pointed out the question of efficient 
countermeasures. 
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Abstract. We proposed unified hardware architecture for the two 128-bit block 
ciphers AES and Camellia, and evaluated its performance using a 0.13-/ffii 
CMOS standard cell library. S-Boxes are the biggest hardware components in 
block ciphers, and some times they consume more than half of the design area. 
The S-Boxes in AES and Camellia are the combination of affine transforma- 
tions and multiplicative inversions on a Galois fields. The size of the fields is 
same, but their structures are different. Therefore we converted the basis be- 
tween the fields by using isomorphism transformations, and shared the inverter 
between AES and Camellia. The affine transformations were also merged by 
factoring common terms. In addition to the S-Box sharing, many other compo- 
nents such as permutation layers and key whiting are also merged. As a result, a 
compact hardware of 14. 9K gates with throughputs of 469 Mbps for AES and of 
661 Mbps for Camellia was achieved. The hardware synthesized with speed op- 
timization obtained throughputs of 794 Mbps and 1.12 Gbps for each algorithm 
with 24.4K gates. 



1 Introduction 

The AES (Advanced Encryption Standard) project [1] for the new US federal standard 
block cipher algorithm replacing DES (Data Encryption Standard) [2] was started in 
1997, and Rijndael [3] was standardized as FIPS PUB 197 [4] in 2001. After then, 
many block ciphers that have the AES compatible interface, supporting a 128-bit data 
block and 128/192/256-bit keys, have been proposed for other organizations [5-9]. 
Camellia [9, 10] was developed by NTT (Nippon Telegraph and Telephone Corp.) and 
Mitsubishi Electric is the one of them. In February 2003, the NESSIE (New European 
Schemes for Signatures, Integrity and Encryption) project [6] chose it as a recom- 
mended algorithm and decided to input it to ISO and IETF. 

Camellia has good performance in both software and hardware implementations, 
and a promising alternative of AES. However, supporting multiple algorithms simply 
multiplies the hardware costs, while it is not a big issue in software implementation. 
The algorithm structures of AES and Camellia are completely different; the former 
uses SPN (Substitution Permutation Network) and the latter does a Feistel network. 
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However, they use very similar basic components, such as multiplicative inversion 
on Galois field GF(2*) and affine transformation on GF(2). Therefore, it is possible to 
reduce hardware cost by reusing these common components between two algorithms. 

In this paper, we first propose an unified S-Box architecture sharing a GF inverter 
and merging affine transformations between AES and Camellia, and factoring tech- 
niques for the permutation layers are also shown. Then entire data path architecture 
including a key scheduler is described. Finally ASIC hardware performance in size 
and speed of the proposed architecture is compared with the discrete implementations 
of the two algorithms. 



2 S-Box Structures 

2.1 AES S-Box 

The AES S-Boxes are combinations of a multiplicative inversion on GE(2*) and affine 
transformations. The irreducible polynomial of Equation (1) is used to define the field. 

m(x) = +X"' + + x + 1 (1) 

Eollowing the inversion, an affine transformation A defined by Equation (2) is exe- 
cuted in the S-Box for encryption. In the equation, an operator © means XOR (Ex- 
clusive-OR). In the decryption S-Box, the multiplicative inversion follows the inverse 
affine transformation A"* defined by Equation (3) that is not shown in the AES speci- 
fication [4]. 
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A 128-bit nonlinear function that contains 16 encryption S-Boxes is called Sub- 
Bytes, and that contains 16 decryption S-Boxes is called InvSubBytes. Example hard- 
ware implementations of each S-Box is shown in Eig. 1. An XOR operation with ‘1’ in 
Equations (2) and (3) equals to a NOT operation. XOR followed by NOT and XOR 
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following NOT can be replaced by XNORs (Exclusive NORs). Circuit cost (transistor 
counts and operation delay) is basically same between XOR and XNOR gates, so 
hardware performance can be improved by using XNOR instead of XOR with NOT. 
However the circuits shown in Fig. 1 do not use XNOR and are straightforward im- 
plementations of Equations (2) and (3), because they are much suitable for examples. 




Fig. 1. Straightforward hardware Implementation of AES S-Boxes 



2.2 Camellia S-Box 

S-Boxes of Camellia use multiplicative inversion on a Galois field and affine trans- 
formations in similar fashion of AES. The Camellia description [9, 10] only shows a 
truth table of the inversion, but its field structure is not clearly described. So we inves- 
tigated a lot of fields, and found that the field extended by using irreducible polynomi- 
als in Equation (4) satisfies the truth table. 

JgF(2"): g^(x) = x^+x + l (4) 

[GF((2“)") : g^(x) = x^ +X + CD (« = (1001)) 

Two sets of four S-Boxes (S1-S4) are used every iteration round. In the S-Box SI, 
affine transformations F and FI defined by Equations (5) and (6) are executed before 
and after the multiplicative inversion respectively. 

0 1 0 0 0 1 0 oVao®l 

1 00000 1 0 a, ei 

0 0 1 0 1 0 0 1 flj 

0 0 1 0 0 0 0 1 a, 

0 0 0 1 0 0 1 0 a, 

0 1 00 1 000 a,®\ 

1 0 0 0 0 0 0 1 a, 

0 0 0 1 0 1 0 olla, ei 




( 5 ) 
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( 6 ) 



The S-Boxes S2 and S3 are defined as SI followed by 1-bit right rotation 
{b^, b^, b^, b^, by by by b^) and 1-bit left rotation (b^ by by by by by by b„) respectively. 
The input bits of SI are rotated as («„, Oy Oy Uy ay Gy a,) for the S-Box S4. Fig. 2 
shows an example circuit of the Camellia S-Boxes. Also here, it is possible to combine 
XOR with NOT into XNOR. 





Affine TransformationF Affine Transformation// 









Fig. 2. Camellia S-Boxes 



3 Unified S-Box 

3.1 Construct Unified S-Box 

In this section, we propose the shared S-Box architecture where a multiplicative 
inverter is reused and affine transformations are merged between SubBytes, InvSub- 
Bytes and S1-S4. Fig. 3 shows the process of S-Box sharing. The right arrows in the 
figure are all 8-bit data buses. In Fig. 3 (1), two S-Boxes SubBytes and InvSubBytes 
are independently implemented, and have the same GF(2 ) inverter. In (2), the inverter 
is shared between the S-Boxes by switching affine transformations A and A'' using 2: 1 
selectors. We also want to share the inverter with Camellia, but AES and Camellia use 
different Galois field for their S-Boxes. Flowever, all fields who have same size are 
isomorphic, so we map all elements on the AES’s field to the Camellia’s composite 
field, and use the GF((2‘') ) inverter for all S-Boxes. It is possible to use the AES’s 
GF(2 ) inverter in the Camellia S-Boxes, but the GF(2*) inverter generated from a 
look-up table is much bigger than the GF((2“) ) inverter where sub-field arithmetic for 
compact implementation can be applied. The structure of the GE((2“) ) inverter is 
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detailed in Fig. 4, where the width of all data buses is 4 bits. The box [x '] shows a 
GF(2‘‘) inverter, and is designed as SOP (Sum of Products) logic in our implementa- 
tions. 

(1) AES S-Boxes (4) Merge affine and isomrphism functions 




Fig. 3. Unified S-Box architecture 




S and S '^ in Equations (7) and (8) are isomorphism functions from GF(2*) to 
GF((2'*)^) and from GF((2‘')^) to GF(2*) respectively. The functions are defined as 8x8 
XOR matrices as same as the affine transformations used in the S-Boxes. In Fig. 3(3), 
the isomorphism functions S and S'^ placed before and after the GF((2'*)^) inverter 
expand the critical path. In order to shorten the path, the isomorphism functions S and 
S ' are combined with affine transformations A and A'' in Fig. 3(4). The combined 
functions A'xS and <J 'xA defined by Equations (9) and (10) require 43 2-input XOR 
gates, while 48 gates are used for A and A'\ Therefore, circuit size is slightly reduced. 

By comparing the matrices between Equations (7) and (9), and between (8) and 
(10), many common terms (Is in same columns) can be found. Therefore, hardware 
size can be reduced by sharing XOR gates corresponding to these common terms. 
However, the half of input bits of Equation (9), (a^, a^, a^, a,), are XORed with ‘1’, and 
thus these values are reversed before the matrix operation. Therefore the common 
terms for these bits cannot be shared between Equations (7) and (9). In order to share 
these bits, replace these reverse operations on the input bits with those on the output 
bits as shown in Equation (11). This is possible because these matrix operations are all 
linear functions. The AES S-Box circuit after merging the matrices becomes Eig. 3(5). 
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(7) 



( 8 ) 



(9) 



( 10 ) 



( 11 ) 



Finally, we also merged the Camellia affine transformations F and H according to 
the same manner, and obtained the shared S-Box shown in Fig. 3(6). Before merging 
F defined by Equation (5), here we also transform it to Equation (12). In the Camellia 
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S-Boxes S1-S3, only the order of the output bits is different. Therefore, the circuit 
shown in Fig. 3(6) can be used for all of them by only twisting the output wires. On 
the other hand, the input bits are twisted in the S4 S-Box. Therefore, we use the affine 
transformation F ’ defined by Equation (13) instead of F, where the columns of the 
matrix is rotated to the right by one bit 
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3.2 Hardware Performance of Unified S-Boxes 

In this section, hardware performance of the unified S-Box is compared with the AES 
and Camellia S-Boxes that are independently implemented. 

Table 1 shows the number of XOR gates required for each matrix operation de- 
scribed in the previous section. Common terms are not shared between the matrices 
for the numbers in “original matrices”, and are shared for the number in “sharing 
common terms.” While the original matrices require 102 XOR gates in total, the num- 
ber is reduced by more than 40% for the shared S-Boxes (60 XORs or 56 XORs). 

The S-Box performances in size and speed are shown in Table 2, where a 0.13-//m 
CMOS standard cell library is used. One gate is equivalent to 2-input NAND gate, and 
the speed is estimated under the worst case conditions. The GE((2“) ) inverter shown in 
the Eig. 4 is used in all S-Boxes. Two discrete S-Boxes (SubBytes and InvSubBytes) 
shown in Eig. 3(1), and one unified S-Box (SubBytes + InvSubBytes) in Eig 3(5) are 
implemented for AES. The performances between four Camellia S-Boxes are all 
same, and those between three shared S-Boxes (AES-tSl~S3) are also same. When 
number of merged S-Boxes is increased (SubBytes with InvSubBytes, then with 
S1-S4), critical path becomes longer, because number of selectors is increased. The 
shared S-Box uses 411-414 gates that is almost half of 816 (=280 h-280h- 256) gates 
required for discrete implementation of two AES S-Boxes and one Camellia S-Box. In 
the actual use, a 3:1 selector is additionally needed for discrete implementation to 
switch three S-Boxes. 
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Table 1. Numbers of XOR gates required for each matrix operation 







Original matrices 






Sharing common 
terms 


s 


S' 


A ‘x^ 


S'xA 


F 


H 


Total 


AES-h 

S1-S3 


AES-h 

S4 


20 


21 


22 


21 


9 


9 


102 


60 


56 



Table 2. ASIC Performance of each S-Box circuit 





S-Box type 


Gate counts 


Delay (ns) 


AES 


SubBytes 


280 


3.65 


InvSubBytes 


280 


3.56 


Merged 


349 


3.99 


Camellia 


S1~S4 


256 


3.45 


AES-l-Canellia 


AES-I-S1--S3 


411 


4.29 


AES -1- S4 


414 


4.65 



(0.13-//m CMOS, Igate = 2-input NAND, worst condition) 



4 Unified Permutation Layer 

AES uses permutation layers MixColumns and InvMixColumns in encryption and 
decryption respectively. MixColumns and InvMixColumns are inverse functions each 
other, and each function is defined as four 4-byte (4x4x8 bits =128 bits) matrix op- 
erations. On the other hand. Camellia has only one permutation layer called P-function 
that is single 8-byte (8x8 bits = 64 bits) matrix operation. In order to merge these 
functions, compose two 8-byte matrices by gathering two 4-byte MixColumns and two 
4-byte InvMixColimns respectively, and factorize them into a few 8-byte matrices as 
shown in Equations (14) and (15). Multiplications with the constant valued {8, 4, 3, 2, 
1 } in the matrices are defined over modulo m(x) of Equation (1). By comparing Equa- 
tions (14) for MixColumns and (15) for InvMixColumns, it is found that MixColumns 
is completely included in InvMixColumns [11]. Equation (16) is the matrix represen- 
tation of P-function whose elements are ‘0’ and ‘1’, and thus modular arithmetic is not 
required. By breaking P-function into two matrices, many common terms with Mix- 
Columns are found. After the factorization and common term sharing, the permutation 
functions are represented as Equations (17)~(20). The basic structure of shared per- 
mutation circuit is shown in Eig. 5. 




InvMixColumns MixColumns P-function 

Fig. 5. Unified permutation circuit 
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=^o + Zo 
3'r =E,+Zi 

yi=Eo + Z2 

3's =^i +Z 3 



y, =E^+z, 
yi=E,+z, 

3'6 = ^2 + z« 

yi =£3 +z, 



Zo “ + zlj + Z4 — 2A4 + 

Zj = 2Aj + A5 + ^2 Z5 = 2^ + 

Z2 — 2.A2 + + X3 Zg — 2zlg + II2 

Z3 = 2A5 + Aj + Xq Z7 = 2A7 + 



( 14 ) 



( 15 ) 



( 16 ) 



( 17 ) 



( 18 ) 
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=A^+B^+x^ 
w^= A^+B^+ X, 
h>2 = A^+B^+ Xj 
h>3 = A, -I- S3 -I- X3 



^4 ^0 ^0 

W5 = A, -I- S, 
w, =A^ -1-S2 
w, = A3 -I- S3 



(19) 



Aq =x„-I-x, 

A, = Xj -I- Xj 

Aj = Xj -I-X3 

A3 = X3 -I- X4 

Co =4(x„ -l-Xj) 
Cj = 4(x, -I- X 3 ) 
C 2 = 4(JC4 + ^6 ) 

C3 = 4 (x 5 +X-,) 



A4 = X4 -I- X, 

A, =x, -l-x, 

Ao =Xj -l-x, 

A, =x^+ X4 

j£>„=2(C„+C,) 
[£>, =2(C2-rC3) 



Sq — Ag -t- Xg 

S, =A, -l-Xg 

< 

B^=A^+ X-, 
S 3 = A 5 +^^4 

■^0 “ -^0 ^0 

E2 — D\ + C2 

S 3 = Z)j + C 3 



( 20 ) 



Table 3 indicates the hardware size and number of stages of critical path in XOR 
gates for the permutation functions. The total size of our unified permutation circuit 
are only 476 XORs, while the original functions require 1,482 XORs. Therefore, the 
hardware cost is reduced down to less than 1/3 with only additional 2 XOR-gate delay. 



Table 3. Performance of permutation functions 





Original matrices 


Sharing 




2 Mix- 


2 InvMix- 


P-func 


Total 


common 




Columns 


Columns 


terms 


XORs 


304 


880 


288 


1,482 


476 


Delay (gates) 


3 


5 


3 


5 


7 



5 Unified Data Path Architecture 

Fig. 6 shows the unified data path architecture of the data randomization block. In 
addition to sharing S-Boxes and permutation, FL/FL'' and key whitening functions are 
also merged. Only 128-bit key is supported in the current design, but 192- and 256-bit 
keys can be easily supported by modifying the key scheduler shown in Fig. 7. Many 
components are shared in the data randomization block, but only registers can be re- 
used in the key scheduler, because the key scheduling methods of two algorithms are 
much different. 

128 bits are processed in one clock cycle by FL/FL ’ and key whitening in Camellia 
and the first AddRoundKey (using key whitening path) in AES. The other function 
blocks handle a 64 bits at one time. A straightforward implementation of AES that has 
128-bit data path takes 11 cycles for the one block encryption or decryption. But the 
unified hardware of Eig. 6 processes 64 bits / cycle for the 2-11 rounds, and thus, it 
takes 20 cycles for these rounds. The key scheduler reusing the S-Boxes in the main 
data path requires additional 10 cycles. Therefore, our unified hardware takes 
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6 ASIC Performance Comparison 

Table 4 shows the performance comparison between our unified hardware and inde- 
pendent implementations of the two algorithms [11, 12], where same 0.13-//m CMOS 
standard cell library is used for all. Two circuits were generated from each design 
source by indicating area and speed optimizations to a synthesis tool. The number of 
S-Boxes are eight (64 bits) in all designs. 

In comparison with the AES hardware of the reference [11], the number of cycles 
of the unified hardware is one cycle fewer. As mentioned before, this is because a 
128-bit block is processed at once in the first AddRoundKey while it is executed by 64 
bits and takes two cycles in [11]. On the other hand, the Camellia operation takes 22 
cycles in the unified hardware, while it is 18 cycles in [12]. Because the Camellia 
hardware in [12] executes the FL/FU' functions or the key whitening, and the F func- 
tion in a same cycle. This approach is suitable for high-speed implementation, but 
requires additional hardware. Therefore we did not use it for the unified hardware 
whose priority is compactness. The maximum operation frequency of the unified 
hardware is lower than that of references [11, 12]. This is because the critical path 
became longer due to the additional hardware such as selectors to merge two data 
paths of different algorithms. 

Discreet implementations require 21. 6K (=8.0K-tl3.6K) gates for compact versions 
of two algorithms and 34. 6K (=14.8 Kh- 19.8K) gates for high-speed versions. On the 
other hand, our unified hardware is 30% smaller, 14. 9K gates and 24. 4K gates respec- 
tively. The throughputs of the unified hardware are 9-14% lower for AES and 
31-40% lower for Camellia. Therefore, the proposed architecture is much suitable for 
the application such as embedded use where hardware resource is more critical than 
speed. 



Table 4. Hardware performance comparison 











Max. 












Gate 

counts 


Throughput 

(Mbps) 


Synthesis 

optimization 




Algorithms 


Cycles 


frequency 








(MHz) 


This work 


AES 

Camellia 


31 

22 


14,918 


113.64 


469.22 

661.18 


Area 


AES 

Camellia 


31 

22 


24,424 


192.31 


794.05 

1,118.89 


Speed 


Reference 

[11] 


AES 


32 


7,998 


137.17 


548.68 


Area 


14,777 


218.82 


875.28 


Speed 


Reference 

[12] 


Camellia 


18 


13,557 


153.85 


1,094.04 


Area 


19,783 


227.27 


1,616.14 


Speed 



(0.13-//m CMOS, Igate = 2-input NAND, worst condition) 




316 



A. Satoh and S. Morioka 



7 Conclusion 

Unified hardware architecture for the 128-bit block ciphers AES and Camellia was 
proposed and its performance was evaluated in comparison with non-unified imple- 
mentations. To merge the biggest hardware component S-Box between two algo- 
rithms, a multiplicative inverter on GF((2'‘) ) was shared by using isomorphism trans- 
formation, and factoring technique was applied on affine transformations. The per- 
mutation layers were also merged by sharing common terms of the operator matrix. 
Our architecture was synthesized by using a 0.13-//m CMOS standard cell library, and 
compact implementations of 14.9K~24.4K gates were obtained with throughputs of 
469M~794Mbps and 661M~1, 119Mbps for AES and Camellia respectively. The gate 
counts were 30% smaller than the conventional implementations where two algo- 
rithms were discreetly designed. 
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Appendix 1 AES Algorithm 

Fig. A1 shows an AES encryption process under a 128-bit secret key. 1 1 sets of round 
keys are generated from the secret key, and are fed to each round of the SPN block. 
The round operation is combination of four primitive functions, SubBytes (sixteen 8- 
bit S-Boxes), ShiftRows (byte boundary rotations), MixColumns (4-byte X 4-byte 
matrix operation), and AddRoundKeys (bit-wise XOR). In decryption, the inverse 
functions (AddRoundKey is identical) are executed in reverse order 

The key scheduler uses four S-Boxes and 4-byte constant values Rcon[i] (i=l~10). 
The highest byte of Rcon[i] is the bit representation of the polynomial x mod m(x), 
and the other three bytes are all zeros. In decryption, these sets of keys are used in 
reverse order. 




SPN Rounds 



Key Scheduling 





Fig. Al. Encryption process of AES algorithm 



Appendix 2 Camellia Algorithm 

Fig. A2 shows the encryption process of Camellia for a 128-bit secret key. At the 
initial and final stages, 128-bit data is XORed with 128-bit round keys. A 22-round 
data randomization part consists of three 6-round Feistel networks, and two FL/FL"' 
functions placed between the networks. The 128-bit data input to the Feistel network 
is divided into two 64-bit data blocks, and the left half is fed into the F function with a 
64-bit round key, and its output is XORed with the right half. The left and right half 
are swapped every round. 64-bit data input to the F function is XORed with the 64-bit 
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round key. The result is divided into eight 8-bit blocks, and they are fed to eight S- 
Boxes (S1-S4) followed by the P-function. Same data path can be used in decryption 
by just changing order of round keys. 

As shown in Fig. A3, a 128-bit intermediate key is generated from the 128-bit 
secret key by using the F function 4 times. The round keys are generated from 
and by bit rotations. 




Fig. A2. Encryption process of Camellia for a 128-bit key 



El 

I2 




Fig. A3. Intermediate key generating process for a 128-bit key 
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Abstract. In this paper a compact FPGA architecture for the AES al- 
gorithm with 128-bit key targeted for low-cost embedded applications is 
presented. Encryption, decryption and key schedule are all implemented 
using small resources of only 222 Slices and 3 Block RAMs. This im- 
plementation easily fits in a low-cost Xilinx Spartan II XC2S30 FPGA. 
This implementation can encrypt and decrypt data streams of 150 Mbps, 
which satisfies the needs of most embedded applications, including wire- 
less communication. Specific features of Spartan II FPGAs enabling com- 
pact logic implementation are explored, and a new way of implementing 
MixColumns and InvMixColumns transformations using shared logic re- 
sources is presented. 



1 Introduction 

The National Institute of Standards and Technology (NIST) selected the Rijn- 
dael algorithm as the new Advanced Encryption Standard (AES) [29] in 2001. 
Numerous FPGA [2] [15] [16] [17] [18] [19] [20] [24] [25] [26] [27] [28] and ASIG 
[4] [6] [7] [8] [10] [11] implementations of the AES were previously proposed and 
evaluated. To date, most implementations feature high speeds and high costs 
suitable for high-end applications only. 

The need for secure electronic data exchange will become increasingly more 
important. Therefore, the AES must be extended to low-end customer products, 
such as PDAs, wireless devices, and many other embedded applications. In order 
to achieve this goal, the AES implementations must become very inexpensive. 

Most of the low-end applications do not require high encryption speeds. Gur- 
rent wireless networks achieve speeds up to 60 Mbps. Implementing security 
protocols, even for those low network speeds, significantly increases the require- 
ments for computational power. For example, the processing power requirements 
for AES encryption at the speed of 10 Mbps are at the level of 206.3 MIPS [12]. 
In contrast, a state-of-the-art handset processor is capable of delivering approx- 
imately 150 MIPS at 133 MHz, and 235 MIPS at 206 MHz. 

This paper attempts to create a bridge between performance and cost require- 
ments of the embedded applications. As a result, a low-cost AES implementation 
for FPGA devices, capable of supporting most of the embedded applications, was 
developed and evaluated. 
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2 Related Work 



Early AES designs were mostly straightforward implementations of various loop 
unrolled and pipelined architectures [24] [25] [26] [27] [28] with limited number 
of architectural optimizations, which resulted in poor resource utilization. For 
example, AES 8x8 S-boxes were implemented on LUTs as huge tables left for 
synthesizers to optimize. 

Later FPGA implementations demonstrate better utilization of FPGA re- 
sources. Several architectures using dedicated on-chip memories implementing 
S-boxes and T-boxes were developed [15] [17] [18] [19] [20]. 

Recent research focused on fast pipelined implementations in both FPGA [2] 
[3] [14] [18] [19] [20] and ASIG [4] [6] [7] [11] worlds. Unfortunately, most of those 
implementations are too costly for practical applications. 

The first significant step in compacting the AES implementation was made 
when V. Rijmen proposed an AES S-box implementation based on composite 
fields [31]. A similar solution was proposed by J. Wolkerstorfer [13]. Rijmen’s idea 
has already been implementated in FPGA [2], and in ASIGs [4] [6] [8]. S. Morioka 
et al. [10] went even farther and proposed a low-power compact S-box design 
suited for ASIG designs. 



3 Architecture of the Compact Implementation 

We began the design of the compact architecture by analyzing the basic archi- 
tecture, as introduced in [26]. The basic architecture unrolls only one full cipher 
round, and iteratively loops data through this round until the entire encryption 
or decryption transformation is completed. Only one block of data is processed 
at a time making it equally suited for feedback and non-feedback modes of op- 
eration. 

The structure of the AES round for encryption is shown in Fig. 1. The decryp- 
tion round looks very similar, except it employs inverted operations in the fol- 
lowing order: InvShiftRows , InvSubBytes, AddRoundKey and InvMixColumns. 
The SubBytes and ShiftRows operations in Fig. 1 are reordered compared to the 
cipher round depicted in the standard [29] . Their order is not significant because 
SubBytes operates on single bytes, and ShiftRows reorders bytes without altering 
them. This feature was used in our implementation. 

The AES round shown in Fig. 1 reveals a great deal of parallelism. The 
data bytes are ordered from the most significant (byte 0) to the least significant 
(byte F) assuming big-endian representation. The round is composed of 16 8-bit 
S-boxes computing SubBytes, and four 32-bit MixColumns operations, working 
independent of each other. The only operation that spans throughout the entire 
128-bit block is ShiftRows. 

It is possible to implement only four SubBytes and one MixColumns in order 
to compact the AES implementation. Ideally, the resources should be cut by 
four, while execution of one round should take four clock cycles. This approach 
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Fig. 1. Operations within AES encryption round 



would result in approximately four times lower performance than for the basic 
architecture. 

Cutting the resources by 75% may not appear easy. The folded round, as we 
call the modified round, still must transform 128 bits, and storage for all 128 bits 
of the data block must exist. Another complication is related to the implemen- 
tation of the ShiftRows operation. The data bytes processed in the AES round 
cannot return to the same positions in the block register because it would not 
execute the ShiftRows operation. On the other hand, those same bytes cannot 
be placed into locations indicated by ShiftRows because those locations are oc- 
cupied by other bytes that have not yet been processed. Therefore, additional 
bits of intermediate results must be stored, and more logic resources are needed. 

One of the possible architectures for a folded implementation is shown in 
Fig. 2a. This architecture requires one 128-bit register, one 96-bit register and 
one 32-bit wide 4-to-l multiplexer on top of the main cipher operations. The 
multiplexer becomes even bigger when both ShiftRows and InvShiftRows are 
implemented using same logic resources. The execution of one round takes four 
clock cycles. We believe that this, or very similar architecture, was implemented 
by A. Satoh et al. [23], but we cannot be sure since the authors do not provide 
enough detail. Their results show that the 4-cycle round takes 50% of the re- 
sources required by the 1-cycle round, and yields four times lower throughput. 

Another possible architecture is shown in Fig. 2b. The 96-bit register is im- 
plemented as three 32-bit registers inserted into round operations creating a 
pipeline. In the case of FPGAs, those 32-bit registers will most likely be placed 
in the same Slices as logic operations yielding better resource utilization. The 
critical path is also shortened which permits the execution at a higher clock rate; 
however, the execution of the entire round requires seven, instead of four, clock 
cycles. We believe that this architecture was implemented by S. McMillan et al. 
[21], but again, we cannot be certain since the authors did not provide enough 
detail. S. McMillan et al. reported only slight difference of 48 Slices (16%), and 
large difference of 24 Block RAMs (75%), between 1-round unrolled and folded 
architecture. 
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Fig. 2. Folded architecture, a) by A. Satoh et al. [23]; b) by S. McMillan et al. [21] 



3.1 Implementation of a Folded Register 

The two folded architectures described above are very straightforward and re- 
sulted in small logic savings. In order to create a folded architecture with better 
parameters, we decided to explore fine details of FPGA devices. We arranged 
data bytes into rows as shown in Fig. 3. This data arrangement is consistent 
with a state introduced in [30]. The following exercise can now be executed in 
steps: 

1. Read input bytes: 0, 5, A, F; execute SubBytes, MixColumns and AddRound- 
Key on them; write results to the output at locations: 0, 1, 2, 3. This step 
is highlighted in the Fig. 3. 

2. Repeat above operations for input bytes: 4, 9, E, 3; write results at output 
locations: 4, 5, 6, 7. 

3. Repeat above operations for bytes: 8, D, 2, 7; write results at locations: 8, 
9, A, B. 

4. Repeat above operations for bytes: C, 1, 6, B; write results at locations: C, 
D, E, F. Output now becomes input for the next step. 

In those four steps the entire AES round was executed including ShiftRows 
operation. At each step only one byte was read from each input row, and one 
byte was written to each output row. A similar exercise with identical conclu- 
sions can be executed for decryption transformation. Each row can be viewed 
as an addressable 8-bit wide memory. The correct execution of ShiftRows and 
InvShiftRows is now resolved to the proper addressing of each of the memories at 
the consecutive clock cycles. At the fourth clock cycle output memories become 
input memories and vice versa. 



Dual-Port RAM Based Implementation. Each CLB Slice in Spartan II 
FPGA contains two look-up tables (LUT), which are the primary resources for 
logic implementation. Typically LUTs are configured as small 16x1 ROM tables 
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Fig. 3. Data arrangement in the folded architecture. Data bytes involved in the first 
step of calculation are highlighted 



implementing logic functions of up to four inputs; however, other configura- 
tions are also possible. Two LUTs within the same Slice can implement a 16x1 
Dual-Port RAM. An 8-bit wide Dual-Port RAM can be implemented using eight 
CLB Slices. This memory can be divided into two banks; each addressed by a 
different port. One port is used for reading data from the memory, while the 
other one for writes results back to the same memory. The switching between 
banks can be achieved by fliping one address bit in both ports every fourth clock 
cycle. 

The Dual-Port RAM based solution has major advantages over solutions 
presented in Fig. 2: 

— The logic resources required for storing intermediate results are far smaller. 

— The multiplexer used before for ShiftRows and InvShiftRows is no longer 
needed. 

~ The complicated routing resulting from implementation of ShiftRows and 
InvShiftRows is avoided, yielding better performance. 



Shift Register Based Implementation. A better solution may result from 
the following observation: all bytes from the output of AddRoundKey are writ- 
ten into consecutive locations in the output memory in consecutive clock cycles. 
Therefore, we could use a simple shift-register to shift computed data in without 
generating any addresses. Fortunately, LUTs can also be configured as 16-bit 
shift registers with variable taps, as shown in Fig. 4. Four Slices can implement 
an 8-bit wide, 16-bit long shift register. The input of the shift register is used 
for shifting results in while the output, selected dynamically by changing tap 
address, is used for reading data out. This solution encompasses all of the ad- 
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vantages of the Dual-Port RAM based solution, and requires less than a half of 
the logic resources than the Dual-Port RAM. 




Fig. 4. Look-Up Table (LUT) configured as a shift register 



3.2 Implementation of the SubBytes and InvSubBytes 

Various area efficient implementations of AES S-boxes were proposed in [2] [4] 
[6] [8] [10] [13] [22] [23] [31]. All of those implementations are based on an 
idea of transforming the original GF(2®) field into a composite of smaller fields 
GF((2^)^). It is a very attractive solution especially from the perspective of 
an ASIC because its implementation occupies a smaller area than a ROM. In 
the case of FPGAs, S-boxes can be mapped into dedicated Block RAMs treated 
as ROMs, or into LUTs. The latter approach could utilize the idea of compos- 
ite fields. We decided to keep a good balance between utilization of LUTs and 
Block RAMs for the entire design, and implemented our S-boxes on dedicated 
Block RAMs. 

Each Block RAM represents a dual-port memory of 4096 bits. Each port can 
be independently configured for different width and depth [34]. We selected a 
512x8 configuration for each port, which provides access to the same memory 
space in the same way from both ports. A single SubBytes or InvSubBytes im- 
plementation requires a 256x8 ROM. A Block RAM has enough space to imple- 
ment both SubBytes and InvSubBytes, as shown in Fig. 5. Each port has access 
to the entire memory space, and can perform a SubBytes or InvSubBytes trans- 
formation independently of each other. The folded architecture requires only 
2 Block RAMs to implement four SubBytes and four InvSubBytes operations all 
together. 

The Block RAM is a fully synchronous memory. Reading from it requires 
supplying the address one clock cycle before the data appears at the output. 
This feature can be viewed as a pipeline stage introducing a delay of one clock 
cycle. Execution of the entire round in such a circuit would take five clock cycles; 
however, a simple modification can be applied to maintain the execution rate at 
the level of four clock cycles per round. The trick is based on the fact that the 
folded register, described in section 3.1, does not transform data bytes in any 
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Fig. 5. Block RAM based implementation of SubBytes and InvSubBytes 



other way than just reordering them. Therefore, this stage can be safely skipped 
if necessary. It apperars that forwarding of only one byte from the input to the 
folded register to the input of S-boxes is sufficient to maintain the execution rate 
of four clock cycles per round. Unfortunately, different bytes are forwarded in 
the case of encryption and in the case of decryption, as shown in Fig. 7. 

3.3 Implementation of the MixColumns and InvMixColumns 

The 32-bit input to the MixColumns transformation is represented as a polyno- 
mial of the form a(x) = asx^ + U 2 X^ + aix + ag, with coefficients in GT(2®). The 
coefficients of a(x) are also polynomials of the form b(x) = hyx"^ + bgx^ + bgx^ + 
b^x"^ + bgx^ + b 2 x“^ + b\x + bg, with their own coefficients in GF{2). 

The MixColumns multiplies the input polynomial by a constant polynomial 

c{x) = {03}a;3 -h {01}x^ -h {01}x -h {02} (1) 

modulo x'^ + 1. The coefficients in GT(2®) are multiplied modulo x^ + x'^ + x'^ + 
x-|- 1. The InvMixColumns multiplies the input polynomial by another constant 
polynomial: 

d{x) = c~^{x) = |06}a;^ -I- (Odja;^ -I- |09}a: -I- jOe} (2) 

The implementation of the MixColumns is very simple because the coeffi- 
cients of c{x) are small. On the other hand, the InvMixColumns is far more 
complex and occupies larger area. 

A. Satoh et al. [23] proposed an implementation based on the following idea: 
d{x) = c{x) + e{x) + f{x) (3) 



where 



e{x) = {OSjx^ -h (OSja:^ -h |08}x -h {08} 



( 4 ) 
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fix) = { 04 }x 2 + { 04 } ( 5 ) 

This implementation yields logic optimizations since InvMixColumns shares logic 
resources with MixColumns. 

We propose a different method for exploring resource sharing. Our imple- 
mentation is derived as follows: 

c(x) • d(x) = {01} (6) 

If we multiply both sides of the equation (6) by d{x) we obtain: 

c(x) • d^(x) = d(x) ( 7 ) 

where 

(f{x) = { 04 }x^ -h { 05 } (8) 

Note that two of the coefficients of the c?^(x) are equal to { 00 }. 

The MixColumns and InvMixColumns can be implemented using shared logic 
resources as shown in Fig. 6. 




^InvMixColumns 

^MixColumns 



Fig. 6. Implementation of MixColumns and InvMixColumns 



The multiplication by { 04 } and { 05 } lead to following equations: 
b{x) • { 04 } = b5x'^ + 64X® -I- (&7 -I- &3)x® -I- (67 - 1 - 66 - 1 - 62)x'^-|- 

T (6e + 6i)x^ -l- (67 -l- 6 q)x^ -I- (67 -I- 6e)x -I- 65 ( 9 ) 

b{x) • { 05 } = (67 -I- 65)x’^ -I- (6e -I- &4)x® -I- (67 -I- 65 -I- 63)x® -I- {br + be + b4 + b2)x"^+ 

T (6e + 63 -|- 6i)x^ -l- (67 -l- 62 -l- 6 q)x^ -I- (67 - 1 - 66 - 1 - bi)x [bg bg) ( 10 ) 

Their implementation appears area efficient since 4 -input XOR gates are the 
widest gates involved in computations, and they get efficiently implemented in 
4 -input LUTs of the FPGA. 

At the time this paper was written we learned that this technique was first 
discovered and proposed for software implementations by P. Barreto [ 5 ]. V. Fis- 
cher and F. Gramain were the first to apply it in hardware [ 1 ]. 
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3.4 Encryption/Decryption Unit 

Our circuit is capable of performing encryption and decryption. The AES en- 
cryption and decryption rounds substantially differ from the point of view of 
hardware implementations. One of the inconveniences arises from the fact that 
the AddRoundKey is executed after MixColumns in the case of encryption, and 
before InvMixColumns in the case of decryption. Therefore, a switching logic 
is required to select appropriate data paths, which affects the performance, as 
shown in Fig. 7. 




Fig. 7. Implementation of the encryption/decryption unit 



It is possible to reorder the InvMixColumns and AddRoundKey and avoid 
some of the switching. In this case, the key schedule would need to perform 
additional InvMixColumns transformation on most of the subkeys. The InvMix- 
Columns requires much more area than the switching logic. Our implementation 
delivers sufficient performance with the switching logic in place, therefore we im- 
plemented the architecture shown in Fig. 7. 

3.5 Implementation of the Key Schedule 

The key schedule is typically implemented using one of the two methods: com- 
puting keys on-the-ffy for every block of encrypted data, or precomputing them 
in advance and storing. The computation of keys on-the-ffy has an obvious ad- 
vantage of changing keys fast with low or no delay. This performance comes for 
a price of increased power consumption as the key schedule computes over and 
over again for each data block. 

In the case of the AES it is easy to perform key schedule transformations in 
the forward direction, and this is the order the round keys are applied in the case 
of encryption. In the case of decryption round keys are applied in the reversed 
order. The key schedule could compute round keys in the backward direction, but 
it is possible only by starting from the last key, not the main key. Unfortunately, 
the last key can be obtained from the main key only by computing the entire 
key schedule in the forward direction first. For this reason, the key schedule 
computing keys on-the-ffy completely looses its advantage when decryption is 
performed. 
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Our AES implementation is designed to perform encryption and decryption. 
Since we did not see any advantage in computing round keys on-the-fly, we 
selected to implement the key schedule that precomputes all round keys. The 
implementation of the key schedule is shown in Fig. 8. It computes 32-bits of 
the key material per clock cycle, therefore, full key schedule execution takes 44 
clock cycles. The computed round keys are stored in a single Block RAM. 




Fig. 8. Implementation of the key schedule 



The key schedule uses SubBytes operation that is identical to the one used 
in the encryption circuit. Since key schedule does not work simultaneously with 
the encryption unit, it is possible to time share S-boxes between both circuits. 
This approach saves two Block RAMs at the expense of additional switching 
logic, and degraded performance. The performance is affected by the presence 
of the switching logic in the critical path, and by slightly more complicated 
floorplanning and routhing, as encryption/decryption and key schedule units 
are no longer separated. We implemented the switching logic using tri-state 
buffers in order to minimize its influence on the overall performance; however, 
this solution may not be the most desired for various reliability and testability 
related reasons. In the case when tri-state buffers are not allowed in the design, 
a multiplexer should be used for switching. 

4 Targeted Device, Synthesis, and Implementation 
Results 

The goal for this design was to create a low-cost implementation of AES in the 
FPGA targeted for real life applications. Much of the previous research targets 
state-of-the-art technologies forgetting that the individual cost of those devices 
ranges in hundreds of US dollars. We shifted our attention to older technologies 
and smaller devices. Xilinx Inc. produces two low-cost families of devices called 
Spartan II, and Spartan HE. Pricing for Spartan II FPGAs starts from less than 
$10 per unit [35]. 
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Spartan II FPGAs are manufactured in 0.22/xm CMOS process. Their archi- 
tecture is derived from a bigger family of Virtex devices. Spartan HE are based 
on a newer VirtexE family, and are manufactured in O.lSy^m CMOS process. The 
smallest device from the Spartan HE family was too large for our needs. The 
device we selected for our implementation is Spartan II XC2S30; second smallest 
in its family. 

The synthesis of our design was done using Synplify Pro 7.2 from Synplicity. 
We set the constraints for target clock frequency to 60MHz, fanout guide to 100, 
and enabled resourse sharing. We performed synthesis for speed grades -5 and 
- 6 . 

The mapping, placing and routing was done using Xilinx ISE 5.2i package. 
Mapper optimized circuit for area, and router worked with effort level 5. 

The results are given in the Table 1. The maximum frequencies come from 
static timing analysis only. The performance is nearly equally affected by logic 
and routing. The routing vs. logic delays ratio in the critical path is 54/46. Better 
results could be demonstrated with manual ffoorplanning. 



Table 1. Implementation results 



Device 


Area 


Max. clock 


Throughput 




CLB Slices 


Block RAMs 


frequency [MHz] 


[Mbps] 


XC2S30-5 


222 


3 


50 


139 


XC2S30-6 


222 


3 


60 


166 



5 Comparison with Other Designs 

Despite our intensive search we encountered suprisingly few compact implemen- 
tations of the AES algorithm in FPCAs. There exist commercial compact cores 
from Amphion [32] and Helion [33] companies. Both companies provide compact 
cores in encryption or decryption version only, and a 128-bit key schedule. We 
also encountered a JBits implementation by S. McMillan et al. [21]. Their imple- 
mentation uses JBits to tailor the bitstream for particular key, and encryption 
or decryption operation. Therefore, encryption and decryption are never simul- 
taneously present in the circuit, and the key schedule is not implemented in the 
hardware. 

We also collected information about other existing architectures capable of 
encrypting or decrypting data in feedback modes of operation [15] [16] [17] [24] 
[26]. We did not take into account any implementations based on T-boxes as 
they give greater throughput at the expense of much larger area. The basic fea- 
tures of all the implementations are collected in Table 2, and their performance 
characteristics in Table 3. 
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Table 2. Basic features of compared architectures 





Device 


Encryption 


Decryption 


Key Schedule 
128 192 256 


0.22^m 


Our 


Spartan II-6 


• 


• 


• 






P. Chodowiec et al. [15] 


Virtex-6 


• 


• 


• 


• 


• 


A. Dandalis et al. [24] 


Virtex-6 


• 


• 


• 






A.J. Elbirt et al. [16] 


Virtex-6 


• 










V. Fischer et al. [17] 


FLEX lOKE-1 


• 


• 










ACEX lK-1 


• 


• 








K. Gaj et al. [26] 


Virtex-6 


• 


• 








S. McMillan et al. [21] 


Virtex 


• 










0.18/rm 


Amphion CS5220XV [32] 


VirtexE-8 


• 




• 






CS5230XV 


VirtexE-8 


• 




• 


• 


• 


Helion compact [33] 


Spartan HE-6 


• 




• 






fast 


VirtexE-8 


• 


• 


• 






V. Fischer et al. [17] 


APEX 20KE-1 


• 


• 








0.15pm 


Amphion CS5220XV[32] 


Virtex2-5 


• 




• 






CS5230XV 


Virtex2-5 


• 




• 


• 


• 


Helion fast [33] 


Virtex2-5 


• 


• 


• 







Among compact architectures, our design is one of the smallest and offers 
richer functionality than cores from Amphion and Helion because it supports 
both encryption and decryption. Both commercial cores are faster than ours; 
however, they are implemented in a better, thus more expensive technology. The 
implementation by S. McMillan et al. is also very compact and fast; however, it 
benefits from the JBits application which is not likely to work in an embedded 
environment. 

We notice large differences among results for basic architecture. The imple- 
mentation by P. Chodowiec et al. offers the most complete functionality and has 
nearly identical size with the fast Helion core. The implementation by V. Fischer 
et al. on FLEX and APEX also have similar parameters, but do not include key 
schedule. The Amphion core CS5230XV is the smallest implementation in the 
basic architecture, but does not support decryption. 

Relating the results for our compact implementation to the implementations 
in the basic architecture, we can see that the goal of reducing the required logic 
resources by 75% was achieved. Moreover, the throughput of our design is higher 
than the 25% of the best throughput reported for the basic architecture in the 
same technology. 
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Table 3. Performance of all compared cores 





Area 


Throughput 


clock cycles 




GLB Slices 


Block RAMs 


[Mbps] 


per round 


0.22^m 1 


Our 


222 


3 


166 


4 


P. Chodowiec et al. 


~1230 


18 


577 


1 


A. Dandalis et al. 


5673 


0 


353 


1 


A.J. Elbirt et al. 


3528 


0 


294.2 


1 


V. Fischer et al. FLEX 


2530 LE" 


24EAB^ 


451 


1 


ACEX 


2923 LE^ 


12 EAB^ 


212 


1 


K. Gaj et al. 


2902 


0 


331.5 


1 


S. McMillan et al. 


240 


8 


250 


7 


0.18/rm 1 


Amphion GS5220XV 


421 


4 


294 


4 


CS5230XV 


573 


10 


1061 


1 


V. Fischer et al. APEX 


2493 LE" 


50 ESB" 


612 


1 


Helion compact 


392 LUT^ 


3 


223 


4 


fast 


2259 LUT^ 


18 


1001 


1 


0.15pm 1 


Amphion GS5220XV 


403 


4 


350 


4 


CS5230XV 


573 


10 


1323 


1 


Helion fast 


2259 LUT" 


18 


1408 


1 



1 2 LE « 2 LUT « 1 Slice ^ 1 EAB = 2 ESB = 1 BEAM 



We intentionally did not provide Throughput/ Area ratios for any of the com- 
pared designs as this measure can be very misleading when dedicated memories 
are present in the design. 

6 Conclusions 

In this paper the feasibility of creating a very compact, low-cost FPGA imple- 
mentation of the AES was examined. The proposed folded architecture achieves 
good performance and occupies less area than previously reported designs. This 
compact design was developed by thorough examination of each of the com- 
ponents of the AES algorithm and matching them into the architecture of the 
FPGA. 

The demonstrated implementation fits in a very inexpensive, off-the-shelf 
Xilinx Spartan II XG2S30 FPGA, which cost starts below $10 per unit. Only 
50% of the logic resources available in this device were utilized, leaving enough 
area for additional glue logic. This implementation can encrypt and decrypt 
data streams up to 166 Mbps. The encryption speed, functionality, and cost 
make this solution perfectly practical in the world of embedded systems and 
wireless communication. 
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Abstract. Performance evaluation of the Advanced Encryption Stan- 
dard candidates has led to intensive study of both hardware and software 
implementations. However, although plentiful papers present various im- 
plementation results, it seems that efficiency could still be greatly im- 
proved by applying good design rules adapted to devices and algorithms. 
This paper addresses various approaches for efficient FPGA implementa- 
tions of the Advanced Encryption Standard algorithm. As different appli- 
cations of the AES algorithm may require different speed/area tradeoffs, 
we propose a rigorous study of the possible Implementation schemes, but 
also discuss design methodology and algorithmic optimization in order 
to improve previously reported results. We propose heuristics to evalu- 
ate hardware efficiency at different steps of the design process. We also 
define an optimal pipeline that takes the place and route constraints into 
account. Resulting circuits significantly improve previously reported re- 
sults: throughput is up to 18.5 Gbits/sec and area requirements can be 
limited to 542 slices and 10 RAM blocks with a ratio throughput/area 
improved by at least 25% of the best-known designs in the Xilinx Virtex- 
E technology. 



1 Introduction 

In October 2000, NIST (National Institute of Standards and Technology) se- 
lected Rijndael [2] as the new Advanced Encryption Standard. The selection pro- 
cess included performance evaluation on both software and hardware platforms. 
Many hardware architectures were proposed [3 — 16], but most of them were 
simple implementations according to the Rijndael specification. More recently, 
design strategies and implementation approaches were proposed for the imple- 
mentation of block ciphers in reconfigurable hardware [17,18] while other papers 
focused on some interesting algorithmic optimizations, specially for the highly 
expensive substitution box of Rijndael [19,20,21]. This paper addresses various 
approaches for FPGA implementations of the Advanced Encryption Standard 
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algorithm and combines recent observations about Rijndael in efficient designs. 
As different applications of the AES algorithm may require different speed/area 
tradeoffs, we propose a rigorous study of the possible implementation schemes, 
but also discuss design methodology and algorithmic optimization in order to 
improve previously reported results. We first discuss the implementation of the 
substitution box and linear diffusion layer at the algorithmic level. Then we 
examine different possible architectures and optimizations. Finally, we present 
heuristics allowing to evaluate the efficiency of our architectures at different steps 
of the design process. Synthesis and implementation constraints of FPGAs are 
taken into account in order to define maximum and optimal pipeline. We apply 
these notions to loop and unrolled architecture in order to improve circuits per- 
formances and compare our results to the best designs reported in literature. The 
main contribution of this paper has to be found in the improvement of hardware 
efficiency that we define as the ratio throuhput/area: efficiency of best-known 
unrolled architectures is improved by 35% while efficiency of best-known loop 
architectures is improved by at least 25% in the Xilinx Virtex-E technology. 

This paper is structured as follows. The description of the hardware, synthesis 
tool and implementation tool is in section 2. Section 3 gives a short mathemati- 
cal description of Rijndael and we propose an efficient representation of the key 
schedule by means of a key round. The main contribution of this paper lies in 
section 4 where we discuss the possible implementation tradeoffs. Section 4.1 
deals with design methodology and defines hardware efficiency and maximum 
pipeline for FPGAs. Section 4.2 presents possible algorithmic optimization of 
Rijndael. Different schemes for the substitution box are proposed and the dif- 
fusion layer is combined with the key addition. Section 4.3 proposes different 
architectures for various speed/area tradeoffs: loop architectures and unrolled 
architectures are studied and implemented. Finally, section 4.4 defines optimal 
pipeline for FPGAs as well as a heuristic rule to reach it. Practical results and 
comparisons with best known published designs are in section 5 and conclusions 
are in section 6. 



2 Hardware Description 

All our implementations were carried out on a XILINX VIRTEX3200EGGII56- 
8 FPGA. We chose this technology in order to allow relevant comparisons with 
the best-known FPGA implementations of Rijndael. In this section, we briefly 
describe the structure of a VIRTEX FPGA as well as the synthesis and imple- 
mentation tools that were used to obtain our results. 

Configurable Logic Blocks (CLB’s): The basic building block of the VIR- 
TEX logic block is the logic cell (LG) . An LG includes a 4-input function genera- 
tor, carry logic and a storage element. The output from the function generator in 
each LG drives both the GLB output and the D input of the fiip-fiop. Each VIR- 
TEX GLB contains four LG’s, organized in two similar slices. Figure 1, shows 
a detailed view of a single slice. Virtex function generators are implemented 
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cour 




Fig. 1. The VIRTEX slice. 



as 4-input look-up tables (LUTs). In addition to operate as a function gener- 
ator, each LUT can provide a 16 x 1-bit synchronous RAM. Furthermore, the 
two LUTs within a slice can be combined to create a 16 x 2-bit or 32 x 1-bit syn- 
chronous RAM or a 16x 1-bit dual port synchronous RAM. The VIRTEX LUT 
can also provide a 16-bit shift register. 

The storage elements in the VIRTEX slice can be configured either as edge- 
triggered D-type flip-flops or as level-sensitive latches. The D inputs can be 
driven either by the function generators within the slice or directly from slice 
inputs, bypassing function generators. 

The F5 multiplexer in each slice combines the function generator outputs. This 
combination provides either a function generator that can implement any 5-input 
function, a 4:1 multiplexer, or selected functions of up to nine bits. Similarly, the 
F6 multiplexer combines the outputs of all four function generators in the CLB 
by selecting one of the F5-multiplexer outputs. This permits the implementation 
of any 6-input function, an 8:1 multiplexer, or selected functions up to 19 bits. 
The arithmetic logic also includes a XOR gate that allows a 1-bit full adder to 
be implemented within an LC. In addition, a dedicated AND gate improves the 
efficiency of multiplier implementations. 

Finally, VIRTEX FPGAs incorporate several large RAM blocks. These comple- 
ment the distributed LUT implementations of RAM’s. Every block is a fully 
synchronous dual-ported 4096-bit RAM with independent control signals for 
each port. The data widths of the two ports can be configured independently. 
Our hardware: A VIRTEX3200ECG1 156-8 FPGA contains 32448 slices and 
208 RAM blocks, which means 64896 LUTs and 64896 flip-flops. In the next 
sections, we compare the number of LUTs, registers and slices. We also evalu- 
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ate the delays and frequencies thanks to our synthesis tool. The synthesis was 
performed with FPGA Compiler 2 3.7.1 (SYNOPSYS) and the implementation 
with XILINX ISE-5. Finally, our circuit models were described using VHDL. 



3 Block Cipher Description 

Rijndael is an iterated block cipher that operates on a 128-bit cipher state and 
uses a 128-bit key^. It consists of a serie of 10 applications of a key-dependent 
round transformation to the cipher state. In the following, we will individually 
define the component mappings and constants that build up Rijndael, then spec- 
ify the complete cipher in terms of these components. 

Representation: The state and key are represented as a square array of 16 
bytes. This array has 4 rows and 4 columns. It can also be seen as a vector in 
GF(2®)^®. Let s be a cipher state or a key G GE(2®)^®, then Si is the i-th byte 
of the state s and Si{j) is the j-th bit of this byte. 

SubBytes, the non-linear layer 7: The SubBytes transformation is a non- 
linear byte substitution, operating on each byte independently. The substitution 
table (or s-box) is invertible and is constructed by the composition of two oper- 
ations: 

1. The multiplicative inverse in GF(2®). 

2. An affine transform over GF{2). 

Every byte is therefore considered as a polynomial with coefficients in GF(2): 

b{x) = b'jx^ + b^x^ + b^x^ + 642;'* -I- b^x^ + 622;^ -I- 612;^ -I- 602;° 

b{x) (1) 

Then SubBytes consists of the parallel application of this s-box S: 

7(a) = b bi = 0 < 1 < 15 (2) 

The ShiftRows transformation In ShiftRows, the rows of the state are 
cyclically shifted over different offsets. Row 0 is not shifted, row 1 is shifted over 
1 byte, row 2 over 2 bytes and row 3 over 3 bytes. 

The MixColumns transformation 0-. In MixColumns, the columns of the 
state are considered as polynomials over GF(2®) and multiplied modulo 2;^ -I- 1 
with a fixed polynomial c(x), given by: 

c{x) =' 03'x^ -k' OTx^ -k' OT2; -k' 02' (3) 

The polynomial is coprime to 2;'^ -k 1 and therefore is invertible. This can be 
written as a matrix multiplication: 

^ Actually, there exist several versions of Rijndael with different block and key lengths, 
but we focus on this one. 
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Where (63, 62, &i, &o) is a four-byte column of the state. An output byte of Mix- 
Columns (for example bo) can be expressed as: 

bo =' 02' X oo ©' 03' X ai ©' 01' x 02 ©' 01' x 03 

We also define a function X, corresponding to the multiplication with ’02’ mod- 
ulo the irreductible polynomial m{x) = + x'^ + + x + 1\ X : GF{2^) — >• 

GF(28) : X{a) = b^ 

6(7) = a(6) 

6(6) = a(5) 

6(5) = a(4) 

6(4) = a(3) © a(7) 

6(3) = a(2)©a(7) 

6(2) = a(l)©a(7) 

6(1) = a(0) 

6(0) = 0©a(7) 

The round key addition cr[AT]: In this operation, a round key is applied to 
the state by a simple bitwise EXOR. The round key is derived from the cipher 
key by means of the key schedule. The round key length is equal to the block 
length. 

a[k]{a) = b bi = Gi (B ki,0 < i < 15 (4) 

The round transformation p[K]: The round transformation can be written 
as a composition of the four previous transformations: 

p[K] = a[K] o 0 o (5 o 7 = a[K] (6(6(7))) (5) 

The key schedule: The round keys are derived from the cipher key by means 
of the key schedule. This consists of two transformations: the key expansion and 
the round key selection. In our description, SubWord (SW) is a function that 
takes a 4-byte word in which each byte is the result of applying the Rijndael s- 
box. The function Rot Word (RW) returns a word in which the bytes are a cyclic 
permutation of those in its inputs such that the input word (a, 6, c, d) produces 
the output word (b,c,d,a). Finally, RC{i) is an 8-bit round constant for the 
round i. 

The key schedule can be easily described by the use of a key round (3 that takes 
four 4-byte input words, corresponding to a 128-bit key, and produces four 4-byte 
output words. The first round key Ko is the cipher key, then, we have: 



=/3(Xi),i = 0,...,10 



(6) 
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RC{i) 




^ni,o 



K 



1 + 1,3 



Figure 2 illustrates the key round of Rijndael. 

The complete cipher: Rijndael is defined for the cipher key K as the trans- 
formation Rijndael[RT]= a[KQ, Ki, K iq] applied to the plaintext where: 

a[Ko, Ki, RTio] = cr[R:io] o 5 o 7 o {Ol=iP[Kr\) o cr[iFo] (7) 

Our implementations are based on this description of AES Rijndael. 



4 Implementation Tradeoffs 

The optimization methods and the resulting implementation tradeoffs for the im- 
plementation of AES Rijndael can be divided into two classes: architectural and 
algorithmic optimization. Algorithmic optimization exploits algorithmic strength 
inside each round unit. Architectural optimization exploits design techniques 
such as pipelining, loop unrolling and sub-pipelining. 

This paper first considers loop architectures, where only a small number 
m (typically m = 1) of rounds are independently implemented in hardware. 
Loop architectures enables small area circuits but have low throughput. Then 
we improve the throughput at the cost of increased area by the combination of 
loop unrolling and pipelining. Unrolled architectures have a large number 
m of rounds (typically all) that are independently implemented in hardware. 
Pipelining increases the encryption speed by processing multiple blocks of data 
simultaneously. It is achieved by inserting rows of registers among combinato- 
rial logic. Parts of logic between two consecutive registers form pipeline stages. 
In case of block ciphers, each round constitutes a pipeline stage. Finally, sub- 
pipelining is similar to pipelining but also inserts registers inside the round 
functions. 

Concerning algorithmic optimizations, we focused on the critical parts of Rijn- 
dael. Different schemes for the substitution box are proposed and compared. We 
also underline interesting combinations of the MixColumns 9 with the key addi- 
tion (j[K]. In this section, we propose a rigorous study of the possible tradeoffs 
for implementing Rijndael. 
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4.1 Design Methodology 

In [18], a methodology to implement block ciphers in reconfigurable hardware is 
presented, based on simple digital design rules applied to iterated block ciphers. 
Looking at the round functions of iterated block ciphers, it is observed that 
they are mainly built on simple algebraic or logic operations. Therefore, the 
sub-pipelining of round functions is mandatory if efficient designs are wanted. 
Practically, the designer can easily keep his critical path inside one CLB slice. 
Moreover, looking at the CLB strcture, it can be seen that FPGAs involve specific 
constraints that have to be taken into account if an optimal design is wanted. 
As the slice of Figure 1 is divided into logic elements and storage elements, 
an efficient implementation will be the result of a better compromise between 
combinatorial logic used, sequential logic used and resulting performances. These 
observations lead to different definitions of implementation efficiency: 

1. In terms of performances, let the efficiency of a block cipher be the ratio 
Throughput {Mbits / s) / Area {slices). 

2. In terms of resources, the efficiency is easily tested by computing the ratio 
Nbr of LUTs/Nbr of registers: it should be close to one. 

Our implementations of Rijndael were designed in order to maximize these no- 
tions of hardware efficiency. It practically results in the sub-pipelining of every 
component of the round functions. The next section studies algorithmic opti- 
mizations combined with good sub-pipelining. For this purpose, we define the 
maximum pipeline as the pipeline of which number of stages implies that the 
ratio Nbr of LUTs/Nbr of registers is the closest to one (and lower than one). 

4.2 Algorithmic Optimizations: A First Tradeoff 

A. Implementing the substitution box: The Rijndael S-box is a non- 

linear byte substitution used 200 times in Rijndael with 128-bit block length 
and key length. It is invertible and is constructed by the composition of two 
transformations: 

1. The mapping x — >■ x~^, where x~^ represents the multiplicative inverse in 
the field GF(2S). 

2. An affine transformation over GF(2): x — >■ Ax + 6, where A and b are con- 
stants. 

In terms of hardware resources, the substitution box is the most expensive part 
of Rijndael. As a consequence, its implementation is a critical part in the design 
of an efficient encryption core. This transform can be implemented following 
different schemes. We propose to observe three possibilities and the resulting 
constraints. 

Al. The multiplexor model: A first and obvious solution is to consider 

SubBytes as a large multiplexor and take advantage of special FPGA configu- 
rations to implement these ones. Figure 3 illustrates the implementation of an 
output bit of the Rijndael s-box. We pipelined 7 by inserting two register levels 
so that the critical path corresponds to one 4-input LUT, one multiplexor F5 and 
one multiplexor F6. Table 1 summarizes the synthesis results for the non-linear 
transform 7 where the s-box is repeated 16 times. 
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Fig. 3. The substitution box 7 . 

Table 1. Synthesis of the non-linear layer 7 . 



Component 


Nbr of LUT 


Nbr of registers 


7 


144 X 16 = 2304 


42 X 16 = 672 



A2. RAM-based implementation: Another possibility is to use the RAM 

blocks available inside the VIRTEX to implement substitution boxes. The re- 
sulting SubBytes transform uses 8 RAM blocks and is performed in one clock 
cycle. 



A3. Composite field solution: In AES Rijndael, every byte represents an 

element in the finite field GF(2®). It can also be represented as a polynomial of 
degree 8 in the field GF(2): b^.x^ + hQ.x^ + b^.x^ + hi.x^ + b^.x^ + b 2 -x^ + hi.x^ +1^. 
Addition and substraction of polynomials are given by the sum modulo 2 of the 
coefficients of both terms (bitwise XOR). Multiplication in GF(2®) corresponds 
to multiplication of polynomials modulo an irreductible binary polynomial of de- 
gree 8. Rijndael uses m{x) = x^ + x^ + x^ + x+l. As the irreductible polynomial 
is used to construct the field and there are different irreductible polynomials 
of degree 8, several finite fields can be considered and generate different repre- 
sentations of Rijndael. These fields are isomorphic which means that there is a 
one-to-one mapping from one representation of Rijndael to another. Finally, the 
multiplicative inverse of a polynomial b(x) is defined such that: 
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b{x).b ^{x) = l.mod.m{x) ( 8 ) 

In [19], subfield arithmetics are used to propose efficient implementations of 
Galois Field arithmetic, especially in the context of the Rijndael block cipher. 
Computations in the field GF(2®) are replaced by computations in the composite 
field GF(2‘*)^ in order to reduce the size of the tables needed for the inversion. 
Basically, the idea is to consider our polynomial of degree 8 in the field GF(2) as a 
polynomial of degree 2 in the field GF(2'^), say aix + oq, where oq, oi G GF(2^). 
The multiplicative inverse of aix + oq is computed in the field GF(2^)^ as a 
polynomial bix + bo such that: 

{a\x + oo) X {bix + bo) = l.mod.P(x) (9) 

Where P(x) is an irreductible polynomial and coefficients bg, bi can be expressed 
as follows: 

bi = ai.{aQ + Oiflo T ^ 

bo — (flo T ®i)-(®o ^ (10) 

[19] gives details about parameter Z\ and polynomial P{x) as well as an affine 
transform that maps elements of GF{2^) to elements of GF{2'^)^. We imple- 
mented the resulting composite field s-box as represented in Figure 4. We in- 
serted seven pipeline levels in order to get the ratio Nbr of LUTs/Nbr of 
registers close to one. Remark that this representation of the substitution box 
allows to keep the whole design unchanged as the Galois Field transform is used 
twice in order to be compatible with other transforms. Table 2 summarizes the 
synthesis results for the composite non-linear transform 7 where the s-box is 
repeated 16 times. Compared to the multiplexor model, we have traded LUTs 
for registers, and obtained a better efficiency. 



Table 2. Synthesis of the composite non-linear layer 7 . 



Component 


Nbr of LUT 


Nbr of registers 


7 


84 X 16 = 1344 


76 X 16 = 1216 



B. Implementing the Other Components: The Mixadd Combination 

BI. The ShiftRows transform S: This is just routing information and takes 
no place in the design. 

B2. The MixColumns transform 6: Mixcolum operates on a 4-byte column 
and corresponds to multiplications and additions in GF{2^). For example, for 
the output byte bg, we have: 

bo = O2'ao -h' O3'oi -h' Ol'os -h' OFog (11) 
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D? I g? I C? I I C? I g? I g? I ^ 

^0 ^1 ^1 ^^0 ^1 




Z?o 




s-box out 



Fig. 4. The composite substitution box. 



We implemented multiplications with a function X that corresponds with the 
multiplication with ’02’, modulo the irreducible polynomial m{x) =x® + a:‘^ + 
x^ + a;+ 1. Figure 5(a) illustrates the function X. Note that output bits 0,2, 5, 6, 7 




Fig. 5. (a) The function X. (b) Output byte bo of MixColumns. 
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just correspond to input bits shifted. Only 3 bits are modified by an EXOR 
operation. From this, we can easily represent an output byte of 0 as shown in 
Figure 5(b): 



bo — X{clq) 0 X[ai) © Oi © 02 © (I 3 (12) 

Interesting combinations between MixColumns and the key addition can be per- 
formed when observing the structure of the Virtex slice (see Figure 1). Indeed, 
we observe that a slice offers the possibility to perform an EXOR between 5 
bits: four bits are managed by the LUT and the last one by an EXOR gate next 
to the LUT. Our Mixadd transform takes advantage of this configuration and 
keeps the critical path inside one Virtex slice. 

B3. The Mixadd transform e: In Figure 5(b), we observe that an output 

byte of 9 is obtained by a bitwise EXOR between 5 bytes: 3 are input bytes 
and the remaining ones are output bytes of function X. However, looking at the 
bit level, we know that 5 output bits of X are just shifted input bits. For these 
ones, only one register is needed to pipeline the diffusion layer. 

For the 3 remaining bits, there is an additional EXOR inside the func- 
tion X. Therefore, for these bits, we compute the bitwise EXOR between the 
3 left bytes of Figure 5(b) and the output bits of X independently. Then we 
insert a register. A bitwise EXOR operation remains to be carried out and we 
combine it with the key addition. The resulting Mixadd transformation only 
needs two register levels to keep a critical path inside one slice. 

Figure 6(a) illustrates the combination of MixColumns and Addroundkey 
at the bit level. Finally, Table 3 summarizes the synthesis results for the Mixadd 
transformation . 



Table 3. Synthesis of the Mixadd transform e. 



Component 


Nbr of LUT 


Nbr of registers 


€ 


304 


304 



4.3 Implementation Schemes: A Second Tradeoff 

Depending on different optimization criteria, different architectures can be em- 
ployed. Optimization for maximum speed can be achieved by a fully pipelined 
unrolled architecture. In the applications requiring minimum area, a loop archi- 
tecture with only one round implemented seems to be the best choice. In both 
cases, we tried to maximize the efficiency defined in section 4.1. 

Our implementations of AES Rijndael directly results from the previous 
component descriptions. For high throughput constraints, we implemented a 
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Fig. 6. (a) Mixadd transform at the bit level, (b) AES Rijndael: unrolled architecture. 




Fig. 7 . AES Rijndael: loop architecture (a) and (b). 



pipeline version that unrolls the 10 AES rounds illustrated in Figure 6(b). For 
low area constraints, we propose sequential implementations with only one 
unrolled round. Figure 7(a) uses the optimized combination of MixColumns and 
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addkey and its grey functions are actually included into the round p. Figure 7(b) 
modifies the round structure so that the initial and final key additions are 
managed inside the round. It is important to remark that the modification of 
the round structure implies the loss of our mixadd combination and needs an 
additional multiplexor for the last round of the algorithm. As a consequence 
it presents no practical advantage in our FPGA implementations but would 
probably be the best choice for ASICs where the CLB structure does not exist 
and for which the mixadd optimization is therefore not relevant. 

For all our proposals, we evaluated the hardware cost in terms of LUTs, 
registers and slices as well as the frequency results. These results are estimated 
after implementation, using XILINX ISE5. 



4.4 Optimal Pipeline: A Third Tradeoff 

The design methodology and algorithmic optimization of previous sections al- 
lowed us to reach very interesting frequencies after synthesis. However, the im- 
plementation (and specially the routing task) of large designs was a critical 
constraint in our designs. Practically, our most pipelined circuits presented sur- 
prising delays including 20% of logic and 80% of routes. We concluded that the 
real bottleneck of such large ciphers is the difficulty of having an efficient place 
and route: in case of complex circuits, high pipelining is not mandatory. More- 
over, as the difficulty of the place and route task is hardly evaluated, a new 
practical problem is to find the best tradeoff between good synthesis results and 
good implementation results. We propose the heuristic of Algorithm 1 to solve 
this last optimization problem. This heuristic led us to the optimized results of 
the next section where we mention the optimal number of pipeline stages. 

Algorithm 1 Optimal pipeline search 

1. Start from the maximal pipeline defined in section 4.1, i.e. implement Rijndael with 
the best ratio Nbr of LUTs/Nhr of registers', 

2. After implementation, compute the efficiency Ecur = Throughput (Mbits j s') j Area 
(slices)', 

3. OK = 0; 

While OA = 0 do { 

1. Remove the pipeline stage that involves the lowest frequency reduction 
and re-implement Rijndael; 

2. After implementation, compute the efficiency Enxt = Throughput 
(Mbits / s) / Area (slices)', 

3. If E cur ^ Enxt then OK = 1; 

else Ecur — Enxti 

} 

4. The final efficiency Ecur specifies the optimal pipeline; 
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5 Practical Results and Comparisons 

In order to take every possible tradeoff into account in this section, we list our 
results for different architectures and different substitution boxes. The tables 
presented are based on the optimal pipeline defined in previous section. Loop 
architectures (Figure 7(a)) are in Table 4. Unrolled architectures (Figure 6(b)) 
are in Table 5. 



Table 4. Rijndael encryption: loop architectures on VIRTEX3200E. 



Type 


Nbr 

of 

LUT 


Nbr 

of 

reg. 


Nbr 

of 

slices 


RAM 

blocks 


Latency 

(cycles) 


Output 

every 

(cycles) 


Freq. after 
Impl. 
(Mhz) 


Throughput 

(Mbits/sec) 


LUT-based 7 


3846 


2517 


2257 


0 


52 


5/52 


169 


2008 


RAM-based 7 


877 


668 


542 


10 


21 


2/21 


119 


1450 


Composite 7 


2524 


2185 


1767 


0 


82 


8/82 


167 


2085 



Table 5. Rijndael encryption: unrolled architectures on VIRTEX3200E. 



Type 


Nbr 

of 

LUT 


Nbr 

of 

reg. 


Nbr 

of 

slices 


RAM 

blocks 


Latency 

(cycles) 


Output 

every 

(cycles) 


Freq. after 
Impl. 
(Mhz) 


Throughput 

(Mbits/sec) 


LUT-based 7 


33712 


14592 


19072 


0 


42 


1 


86 


11008 


RAM-based 7 


3516 


3840 


2784 


100 


21 


1 


92 


11776 


Composite 7 


19752 


13479 


15112 


0 


72 


1 


145 


18560 



Table 6 . Comparisons with other implementations on VIRTEX-E technology. 



Type 


Nbr of 
LUTs 


Nbr of 
slices 


RAM 

blocks 


Throughput 

(Mbits/s) 


Throughput /Area 

f Mbits/s 


^ slices. LUTs ' 


McLoone et al. [9] 


/ 


2222 


100 


6956 


3.1 


Our design 


3516 


2784 


100 


11776 


4.2 


Helion tech. [10] 


899 


/ 


10 


1187 


1.32 


Our design 


877 


542 


10 


1450 


1.65 


Satoh et al. composite [15] 


/ 


1880 


0 


589 


0.31 


Our design 


2524 


1767 


0 


2085 


1.17 


Satoh et al. mux [15] 


/ 


2529 


0 


833 


0.33 


Our design 


3846 


2257 


0 


2008 


0.88 



Finally, in Table 6, we compare our results with the best implementations of 
Rijndael encryption on VIRTEX-E technology found in literature. For RAM 
based substitution boxes, McLoone and McCanny had the best unrolled imple- 
mentation in CHES 2001 [9] while Helion Technologies [10] had the best loop 
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architecture. For LUT-based substitution boxes, we have no knowledge of any 
unrolled architecture but Satoh and Morioka presented in ASIACRYPT 2001 and 
in the Third NESSIE workshop the best results for loop implementations [15, 
20]. They studied mux-modeled s-boxes as well as composite ones. Finally, con- 
cerning older technologies^ , we report in Table 7 an old result of our LUT-based 
loop architecture and compare it with results of the last AES conference [3,4,5, 
6] . It is obvious that the methodology applied allowed us to significantly improve 
previously reported performances of Rijndael implemented in FPGAs. 

6 Conclusion 

When implementing block ciphers, several strategies can produce effective de- 
signs. Based on recently published works and observations about Rijndael, we 
studied different possible implementation tradeoffs. Inherent constrainst of FP- 
GAs were taken into account in order to define an efficient methodology. We 
defined notions of hardware efficiency and optimal pipeline and our circuits were 
designed in order to optimize different possible architectures: loop and unrolled. 
Inside these architectures, we proposed algorithmic optimizations for the sub- 
stitution box but also efficient combinations between the diffusion layer and the 
key addition. 

Upon comparison, our circuits offer better performance than previously re- 
ported in literature. Gompact and high speed architectures are proposed and 
implemented on VIRTEX-E technology. Throughput is up to 18.5 Gbits/sec 
and area requirements can be limited to 542 slices and 10 RAM blocks with an 
improved ratio throughput/area. Optimized efficiency was obtained by applying 
heuristic rules in order to deal with place and route constraints. 



Table 7. Comparisons with the last AES conference. 



Type 


Nbr of 
slices 


Device 


Throughput 

(Mbits/s) 


Throughput /Area 

( Mbits js \ 

V slir.e.s ' 


Gaj et al. 


2900 


VIRTEXIOOO 


331.5 


0.11 


Dandalis et al. 


5673 


VIRTEXIOOO 


353 


0.06 


Elbirt et al. 


9004 


VIRTEXIOOO 


1940 


0.22 


Our design 


2257 


VIRTEXIOOO 


1563 


0.69 
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Abstract. For most of the time since they were proposed, it was 
widely believed that hyperelliptic curve cryptosystems (HECC) carry a 
substantial performance penalty compared to elliptic curve cryptosys- 
tems (ECC) and are, thus, not too attractive for practical applications. 
Only quite recently improvements have been made, mainly restricted 
to curves of genus 2. The work at hand advances the state-of-the-art 
considerably in several aspects. First, we generalize and improve the 
closed formulae for the group operation of genus 3 for HEC defined 
over fields of characteristic two. For certain curves we achieve over 50% 
complexity improvement compared to the best previously published 
results. Second, we introduce a new complexity metric for ECC and 
HECC defined over characteristic two fields which allow performance 
comparisons of practical relevance. It can be shown that the HECC 
performance is in the range of the performance of an ECC; for specific 
parameters HECC can even possess a lower complexity than an ECC at 
the same security level. Third, we describe the first implementation of a 
HEC cryptosystem on an embedded (ARM7) processor. Since HEC are 
particularly attractive for constrained environments, such a case study 
should be of relevance. 

Keywords: Hyperelliptic curves, explicit formulae, comparison HECC 
vs. ECC, efficient implementation 



1 Introduction 

In 1976 Diffie and Heilman [DH76] revolutionized the field of cryptography by 
introducing the concept of public-key cryptography. Their key exchange pro- 
tocol is based on the difficulty of solving the discrete logarithm (DL) problem 
over a finite field. Years later, [Mil86,Kob87] introduced a variant of the Diffie- 
Hellman key exchange, based on the difficulty of the DL problem in the group 
of points of an elliptic curve (EC) over a finite field. Since their introduction, 
elliptic curve cryptosystems (ECC) have been extensively studied not only by 
the research community but also in industry. In particular, there are several 
standards involving EC, such as the IEEE P1363 [P1399] standardization effort 
and the bank industry standards [ANS99]. It is important to point out that ECC 
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benefit from shorter operand sizes when compared to RSA or DL based systems. 
This fact makes ECC particularly well suited for small processors and memory 
constrained environments. 

In 1988 Koblitz suggested for the first time the generalization of EC to curves 
of higher genus, namely hyperelliptic curves (HEC) [Kob88]. In contrast to the 
EC case, it has only been until recently that Koblitz’s idea to use HEC for 
cryptographic applications, has been analyzed and implemented both in soft- 
ware [Kri97,SS98,SSI98,Eng99,SS00] and in more hardware-oriented platforms 
such as FPGAs [Wol01,BCLW02]. In 1999, [Sma99] concluded that there seems 
to be little practical benefit in using HEC, because of the difficulty of finding 
hyperelliptic curves and their relatively poor performance when compared to 
EC. However, quite recently the efficiency of the HEC group operation has been 
improved [Har00,MDM+02,Tak02,Lan02a]. It is well known that the best algo- 
rithm to compute the discrete logarithm in generic groups such as the Jacobian 
of a HEC is Pollard’s rho method or one of its parallel variants [Pol78,vOW99]. 
For curves of genus higher than four, [GauOO] showed that there exists an algo- 
rithm with complexity 0{q^) where Fg is the field over which the HEC is defined. 
Thus, in this work, we only consider HEC of genus less than four, as curves of 
higher genus are potentially insecure from a cryptographic point of view. 

It is widely accepted that for most cryptographic applications based on EC 
or HEC one needs a group order of size at least « 2^®°. Thus, for HECC over 
Fg we will need at least g ■ log 2 q ~ 2^®°, where g is the genus of the curve. In 
particular, for a curve of genus two, we will need a field F, with |Fq| « 2®°, i.e., 
80-bit long operands. Similarly, for curves of genus three, our discussion above 
implies 54-bit long operands. These field sizes make HEC specially promising 
for use in embedded environments where memory and speed are constrained, 
and where the above operand sizes seem well suited to their small processor 
architectures. 

Our Main Contributions 

Genus 3 group operations: The work at hand presents for the first time gen- 
eralized explicit formulae for genus-3 curves including fields of characteristic 
2. For certain curves our group doubling formula saves more than 66% of 
the field multiplications compared to [KGM+02]. 

New complexity metric for HECC and ECC: We introduce a new metric 
for HECC and ECC over characteristic two fields which is based on an atomic 
operation count rather than on the (theoretical) bit complexity or specific 
timings. The most interesting results: (a) under certain conditions genus-3 
hyperelliptic curves are faster than ECC at the same level of security and 
(b) these HEC are faster than genus-2 curves. 

HECC implementation on an embedded platform: We support our the- 
oretical findings with a HECC implementation on an ARM7TDMI, which is 
one of the most popular embedded processors. Our implementation uses the 
best explicit formulae for genus-2 and genus-3 curves. 
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The remainder of the paper is organized as follows. Section 2 summarizes 
contributions dealing with previous implementations and comparisons of HECC 
and ECC. Section 3 gives a brief overview of the mathematical background re- 
lated to HECC. Section 4 and 5 present our new explicit formulae for genus-3 
curves and a theoretical comparison between ECC and HECC. Section 6 intro- 
duces the implementation of HECC on embedded processors. Finally, we end 
this contribution with a discussion of our results and some conclusions. 

2 Previous Work 

In this section, we summarize previous improvements of the group operation of 
genus-2 and genus-3 curves, earlier theoretical comparisons between ECC and 
HECC, and other HECC implementations. 



Improvements to HECC Group Operations 

Table 1 summarizes the efforts made to date to speed up genus-2 curves. / refers 
to inversion, M to multiplication, S to squaring, and M/S to multiplications 
or squarings, since squarings are assumed to be of the same complexity as a 
multiplication in these publications. For more details on previous improvements 
made to the explicit formulae the interested reader is referred to [PWGP03] 

For genus-3 hyperelliptic curves of odd characteristic the only improvement 
over Cantor’s algorithm was presented in [KGM+02]. The authors adopted the 
methods from [MDM+02,Har00] to obtain the speed-up. The operation com- 
plexity for genus-3 curves is summarized in Table 3. 



Theoretical Comparisons 

In [SSI98] , the authors clarified practical advantages of hyperelliptic cryptosys- 
tems when compared to ECC and to RSA. To our knowledge this is the first 
and only contribution that investigates in detail the theoretical complexity of 
ECC and HECC. They estimated the cost of different cryptosystems based on 
the number of bit operations. In their work they used Cantor’s formula and the 
cost of one multiplication in F 2 « was assumed to take bit operations. One of 
the estimated theoretical results shows that genus-3 curves needed three times 
as many bit operations as elliptic curves. We want to point out that this publi- 
cation used supersingular curves^ and curves of genus higher than 4 which today 
are believed to be insecure due to the attacks presented in [FR94,Gau00,Gal01]. 

In the following years further analyses of the complexity of HECC were pub- 
lished. A theoretical analysis of the computational efficiency of the arithmetic 
on hyperelliptic curves is derived in [Eng99]. In [SSOO], the authors implemented 
hyperelliptic curve cryptosystems and analyzed the complexity of the group law 
on Jacobians Jc(®’p) and Moreover, they verified their theoretical com- 

plexity estimates with a HECC implementation and with the theoretical analysis 

^ [GalOl] gives some arguments against using supersingnlar hyperelliptic curves in 
cryptographic applications. 
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done by Enge in [Eng99]. More recent papers present timings for HECC using 
explicit formulae and compared HECC to ECC [Lan02a]. However, these com- 
parisons were based on the implementation timings. 



Table 1. Speeding up group operations on hyperelliptic curves of genus two. 





field 

charac. 


curve 

properties 


cost 

addition 


doubling 


Cantor [NagOO] 


general 




31 + 70M/S 


31 + 76M/5 


Nagao [NagOO] 


odd 


h{x) = 0, /i G F 2 


II + 55M/S 


11 + 55M/5 


Harley [HarOO] 


odd 


h{x) = 0 


21 + 27M/S 


2/-I-30M/5 


Matsuo et al. 
[MCTOl] 


odd 


h{x) = 0 


21 + 25M/S 


21 + 27M/S 


Miyamoto et al. 
[MDM+02] 


odd 


h{x) = 0, /4 = 0 


I + 26M/S 


I + 27M/S 


Takahashi 

[Tak02] 


odd 


h{x) = 0 


I + 25M/S 


I + 29M/5 


Lange [Lan02a] 


general 


hi G F 2 , /4 = 0 


I + 22M + 3S 


I + 22M + 55 
I + 20M + 45 


two 


hi G F 2 , /4 = 0 


I + 22M + 2S 


Lange [Lan02b] 


general 


hi G F 2 , /4 = 0 


37M + AS{40M + 3S)'‘‘ 


40M -f 65 
33M -t 65 


two 


hi G F 2 , /4 = 0 


46M -1- 25 


Lange [Lan02c] 


odd 


hi G F 2 , /4 = 0 


47M -f 75(36M -f 55)'' 


34M -1- 75 
35M -t 65 
29M -1- 65 


even 


fl 2 # 0, /li G F 2 , /4 = 0 


46M-f 45(35M-t55)^ 


even 


/i2 = 0, hi E F 2 , /r = 0 


44M-f 65(34M-t65)^ 



To our knowledge there is no theoretical complexity comparison between 
ECC and HECC published that uses the explicit formulae for HECC and com- 
pares HECC and ECC in terms of processor instructions, such as shift and XOR 
operations. Hence, this comparison is processor independent and can be adapted 
to any platform. 



HECC Implementations 

Since HEC cryptosystems were proposed, there have been several software im- 
plementations on general purpose machines and, only recently, publications deal- 
ing with hardware implementations of HECC. To our knowledge there has not 
been any work dealing with the implementation of HEC on embedded systems. 
The results of previous HECC software implementations are summarized in Ta- 
ble 2. Detailed information about previously made HECC implementations can 
be found in [PWGP03]. 

The first HECC hardware architectures were proposed in [WolOl]. The per- 
formance of a hardware-based genus two hyperelliptic curve coprocessor over 
F 2113 was presented in [BCLW02]. The FPGA was clocked at 45 MHz and re- 
quired 4750 clock cycles for a group addition and 4050 clock cycles for a group 
doubling operation. 



2 



mixed addition 
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Table 2. Execution times of recent HEC implementations in software. 



reference 


processor 


genus 


field 


^scalarmult. TflS 


[Kri97] 


Pentium® lOOMHz 


2 


F 264 


520 






3 


F 242 


1200 






4 


F 231 


1100 


pS98] 


Alpha@467MHz 


3 


F 259 


83.3 






3 


F 289 


25700 






3 


F 2113 


37900 






4 


F 241 


96.6 




Pentium-II@300MHz 


3 


F 259 


11700 






4 


F 241 


10900 


^SOO] 


Alpha21164A@600MHz 


3 


Fp(log 2 p = 60) 


98 






3 


F 259 


40 






4 


F 241 


43 


[MCTOl] 


PentiumIII@ 866 MHz 


2 


186-bit OEF 


1.98 


[MDM+02] 


PentiumIII@ 866 MHz 


2 


186-bit OEF 


1.69 


[KGM+02] 


Alpha21264@667MHz 


3 


F261_i 


0.932 


[Lan02a] 


Pentium-IV@1.5GHz 


2 


E 2 I 6 O 


18.875 






2 


E 2 I 8 O 


25.215 






2 


Fp(log 2 P — 160) 


5.663 






2 


Fp{log 2 p — 180) 


8.162 



3 Mathematical Background 

In this section we present an elementary introduction to some of the theory 
of hyperelliptic curves over finite fields of arbitrary characteristic, restricting 
attention to material that is relevant for this work. For more details the reader 
is referred to [Kob89,Kob98]. 

3.1 HECC and the Jacobian 

Let F be a finite field, and let F be the algebraic closure of F. A hyperelliptic 
curve C of genus g >1 over F is the set of solutions {u,v) G F x F to the equation 

C : + h{u)v = f{u) 

Such a curve is said to be non-singular if there are no pairs {u,v) G F x F which 
simultaneously satisfy the equation of the curve C and the partial differential 
equations 2v + h{u) = 0 and h'{u)v — f'{u) = 0. The polynomial h{u) G F[m] is 
of degree at most g and f{u) G F[m] is a monic polynomial of degree 2g + 1. For 
odd characteristic it suffices to let h{u) = 0 and to have f{u) square free. 

A divisor D = ^ rriiPi, rrii G Z, is a finite formal sum of F-points. Its degree 
is the sum of the coefficients ^ rrii . The set of all divisors form an Abelian 
group denoted by D(C). The set of divisors of degree zero will be denoted by 
DO c D(C). 

Every rational function on the curve gives rise to a divisor of degree zero, 
consisting of the formal sum of the poles and zeros of the function. Such divi- 
sors are called principal and the set of all principal divisors is denoted by P. If 
D\,D 2 G DO then we write Di ~ D 2 if D\ — D 2 G P; D\ and D 2 are said to 
be equivalent divisors. Now, we can define the Jacobian of C as the quotient 
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group D°/P. If we want to define the Jacobian over F, denoted by Jc(lF)) we say 
that a divisor D = ^ rtiiPi is defined over F (sometimes also called a F-divisor 
or rational divisor) if D°' = ^ rriiP[ is equal to D for all automorphisms cr of 
F over F. Notice that this does not mean that each P[ is equal to cr may 
permute the points. 

In [Can87], Cantor shows that each element of the Jacobian can be repre- 
sented in the form D = Pi — r ■ oo such that for all i yf j, Pi and Pj are 

not symmetric points. Such a divisor is called a semi-reduced divisor. Cantor 
concludes that from the Riemann-Roch Theorem follows that each element of 
the Jacobian can be represented uniquely by such a divisor, subject to the addi- 
tional constraint r < g. Such divisors are referred to as reduced divisors. Finally, 
[Can87] shows that the divisors of the Jacobian can be represented as a pair 
of polynomials a(u) and b(u) with deg b(u) < deg a(u) < g, with a{u) dividing 
+ h{u)v — f{u) and where the coefficients of a{u) and b{u) are elements of 
F [Mum84] (notice that in our particular application F is a finite field). In the 
remainder of this paper, a divisor D represented by polynomials will be denoted 
by div{a, b). 

3.2 Group Operations on a Jacobian 

This section gives a brief description of the algorithms used for adding and 
doubling divisors on J(y(F). These group operations will be performed in two 
steps. First we have to find a semi-reduced divisor D' = div(a',6'), such that 
D' ^ Dx + D 2 = div(ai, 61) + div(o2, 62) in the group Jc(iF')- In the second step 
we have to reduce the semi-reduced divisor D' = div (a',b') to an equivalent 
divisor D = (a,b). Algorithm 1 describes the group addition. 



Algorithm 1 Group addition 
Require: Di — div(ai,6i), D2 = div(a2,&2) 

Ensure: D = div(a, b) = Di + D2 
1 : d = gcd(ai,a2, 61 + &2 + /i) = sifli -t- S2O2 + ss{hi -\-b2-\-h) 
2 : a'o = a\a2jS 

3 : 6q = [siaib2 + 520261 -f 53(6162 + /)]d“^ (modco) 

4 : while degaj, > g do 

C. 

0. Of. — -7 

“ fc -1 

6: b'p. = {—h — b'i^_i) mod aj, 

7 : end while 

8: Output (a = aj,, 6 = 6(.) 



Doubling a divisor is easier than general addition and therefore. Steps 1,2, and 
3 of Algorithm 1 can be simplified. The formulae given for the group operation 
of HECC can be written explicitly as previously mentioned. In Section 4 we 
develop explicit formulae of Cantor’s Algorithm for genus-3 curves. For security 
considerations of the HEC used see [PWGP03]. 
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4 Speed-up for Genus-3 Curves 

In this section we briefly outline the ideas of [GHOO] and [KGM+02] which are 
the starting point for our improvements. In [GHOO], the authors noticed that 
one can reduce the number of operations required to add/double divisors by 
distinguishing between possible cases according to the properties of the input 
divisors. This technique is combined with the use of the Karatsuba multiplica- 
tion algorithm [K063] and the Ghinese remainder theorem to further reduce the 
complexity of the overall group operations. The work of [GHOO] was generalized 
by [KGM+02] to genus-3 curves defined over odd characteristic fields. In par- 
ticular, they notice that for genus-3 curves there are 6 possible choices for the 
degree of the input polynomials to Algorithm 1 and that further classification 
according to the common factors of the polynomials would lead to about 70 sub- 
cases. However, they only consider the most frequent cases^ which occur with 
overwhelming probability of 1 — 0{\/q) « 1 — 2“®° for genus-3 curves over F 260 . 
For the remaining cases, they use Gantor’s algorithm. 

In this work, we further optimize the formulae of [KGM“'"02] and generalize 
them to arbitrary characteristic. Table 7 presents the explicit formulae for a 
group addition and Table 8 those for a group doubling. The formulae shown in 
the tables are based on the assumption that hi £ {0, 1}, where t = 0, 1, 2, 3, and 
that /e is equal to zero. The latter can be achieved by substituting x' = x + y- 
The coefficient is still included in the algorithm for completeness. 

Our improvements are based on the following techniques [PWGP03]: 

1. Montgomery’s trick of simultaneous inversions [Goh93, Algorithm 10.3.4] 

2. Reordering of normalization step [Tak02] 

3. Karatsuba multiplication 

4. Galculation of the resultant using Bezout’s matrix 

5. Ghoice of HEG 



Table 3. Comparing the complexity of the group operations on HEC of genus three. 





field 

characteristic 


curve 

properties 


C( 

addition 


rst 

doubling 


Cantor [NagOO] 


general 


h(x) = 0, /i G F 2 


41 + 200M/S 


41 + 207M/S 


Nagao [NagOO] 


odd 




21 + 154M/S 


21 + 146M/S 


Kuroki et al. 
[KGM+02] 


odd 


h(x) = 0, /e = 0 


I + 81M/S 


I + 74M/S 


This work 
(Tables 7, 8) 


general 

two 

two 


hi G F 2 , fe ~ 0 
hi G F 2 , /e =0 
h{x) = 1, fe = 0 


I + 70M -f 6S 
I + 65M -f 6S 
I + 65M -t OS' 


I + 61M + WS 
I + 53M -f lOS 
7 -1- 14M -1- IIS' 



® For additiou the inputs are two co-prime polynomials of degree 3, for doubling the 
input is a square free polynomial of degree 3 





358 



J. Pelzl et al. 



As a summary we include the computational cost of all the published results 
for genus-3 curves in Table 3. Compared to [KGM+02], we save 5 multiplications 
in the addition algorithm and 3 multiplications in the doubling algorithm even 
though our formulae are more general. 



5 Comparing ECC and HECC 

In the past, providing complexity measures and, thus, comparisons between ECC 
and HECC was a difficult undertaking. The operations involved in both systems 
were very different (different field orders, field operations vs. operations with 
polynomials, etc.). Furthermore, measures such as the bit complexity often pro- 
vide very little information about the de facto complexity in actual implemen- 
tations. The underlying motivation for the work described in the following was 
the development of a more accurate metric for practical purposes. All operations 
which are computationally expensive will be expressed in terms of atomic opera- 
tions (AOPS), such as processor word-SHIFTs and XORs. In particular, we will 
decompose field multiplications into AOPS. This provides a metric which allows 
a comparison of fields of different sizes which is crucial for comparing ECC and 
HECC with equal level of security. The approach possesses the advantage that 
it accurately counts the actual elementary processor operations (as opposed to 
the more theoretical bit complexity), while at the same time avoiding processor 
and implementation-dependent “tricks” which can skew comparisons that are 
merely based on timings. In summary, we believe we developed a method which 
allows accurate predictions of the performance on a given processor without the 
laborious task of actually implementing the cryptosystem. The accuracy of the 
new metric is demonstrated by a mere 12% difference between our theoretical 
and practical results. 

The number of atomic operations is denoted as AOPS. In our comparison we 
make the following assumptions: 

1. We only consider fields of characteristic two and thus neglect the cost of 
squaring. 

2. We perform field multiplications with Algorithm 5 published^ in [LDOO] . This 
algorithm requires 3 -I- 2(w/4 — 1) word-SHIFTs and s(ll -|- n/4) -|- 8(2s — 1) 
word-XORs, where w is the word size of processor and s = \ff~\ is the number 
of words needed to represent an element of the underlying field F 2 ». 

3. We express the cost of one field inversion as m field multiplications and 
denote the ratio of multiplications to inversions as M7-ratio. 

Based on the assumptions stated above, the complexity of the group oper- 
ations of HEC and EC are summarized. Referring to Tables 7 and 8, a divisor 
addition for a genus-3 curve requires 1/-I-65M and doubling needs 1/-I-53M (us- 
ing a curve with h = 1, doubling needs only 1/ -|- 14M). Assuming that the cost 

^ To our knowledge this is the fastest published multiplication algorithm for finite 
fields of characteristic two. 
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of one field inversion is equivalent to m field multiplications, leads to (65 + m)M 
and (53 + m)M for addition and doubling, respectively. Due to the higher ex- 
tension of the underlying field used for genus-2 curves, a different M7-ratio I is 
used. This leads to (22 -|- l)M for a divisor addition and (20 -I- l)M for a divisor 
doubling. The number of inversions and multiplications for a group operation 
on EC heavily depends on the chosen coordinate system. For completeness we 
summarize the number of required operations in Table 4. 



Table 4. Field operations required in each coordinate system [HHMOO] 



Coordinate system 


EC . 
general 


Addition 
mixed coord. 


EC Doubling 


Affine coordinates 


II + 2M 




1I + 2M 


Standard projective coordinates [CC87,CM098] 


13M 


12M 


7M 


Jacobian projective coordinates [CC87,CM098] 


15M 




5M 


New projective coordinates [LD99] 


UM 


9M 


4M 



Figure 1 illustrates the number of operations for a scalar multiplication on 
a 32-bit processor depending on the M/-ratios. The scalar multiplication with 
an n-bit scalar is realized by the sliding window method with an approximated 
cost of n ■ doublings -1-0.2 • n ■ additions for a 4-bit window size [BSS99]. Figure 1 
allows to estimate the efficiency of an ECC or a HECC built on top of a given 
field library by comparing the different M/-ratios. 

In general we can draw the following conclusions from this comparison: 

1. ECC with projective coordinates is in almost all cases the most efficient 
cryptosystem. 

2. Scalar multiplication of genus-3 HEC with h(x) = 1 always outperforms 
genus-2 HEC. 

3. Genus-3 HECC scalar multiplication is in most cases faster than ECC using 
affine coordinates. 

4. For field libraries with very high M/-ratio, ECC using Jacobian projective 
is more efficient than genus-3 HEC. However, for low M/-ratios the HECC 
scalar multiplication becomes less expensive. 

6 HECC on Embedded Systems 

With the predicted advent of ubiquitous computing, embedded processors will 
play an increasingly important role in providing security functions. Due to their 
relatively short operand lengths, HECC are particularly well suited for embedded 
processors which are typically computationally constrained. We chose a repre- 
sentative of the popular ARM processor family for our implementation. The 
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Ml-ratio (k,l,m) 



Fig. 1. Cost of a scalar multiplication for different Ml-ratios and cryptosystems in 
AOPS (32-bit /iP, group order « 



purpose was twofold. First, we wanted to provide actual timings of a highly op- 
timized HEC implementation. Secondly, we wanted to validate our complexity 
metric. 

The ARM7TDMI@80MHz® processor environment was chosen to implement 
both elliptic and hyperelliptic curve cryptosystems. For the elliptic curve case we 
used curves over F2191 and Jacobian projective coordinates. The most efficient 
explicit formulae were implemented in the case of HECC. For genus-3 curves the 
polynomial h{x) equals one. The group orders range from 2^®^ to Table 5 
presents timings for divisor addition, divisor doubling and scalar multiplication 
on the ARMulator. To our knowledge these are the first published timings for 
HECC on an embedded processor. 

To theoretically determine the most efficient cryptosystem based on the tim- 
ings given in Table 6, one can either use Figure 1 or calculate the necessary 
number of AOPS. Considering a finite field F263 for a genus-3 HEC, 619,402 
AOPS are needed to calculate one scalar multiplication. HECC of genus 2 with 
the underlying field F295 will take 1,049,028 AOPS, and ECC over F2191 us- 
ing Jacobian projective coordinates requires 699,060 AOPS. Thus, we expect 

® Depending on the features of processor board, the performance numbers can differ. 
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HECC of genus-2 to be a factor of 1.5 slower and genus-3 HECC a factor of 1.1 
faster than ECC. Genus-2 HECC is expected to be 1.5-times slower than genus-3 



HECC. 



Table 5. Timings of group operations with ARMulator ARM7TDMI@80MHz (explicit 
formulae) 



Genus 


Field 


Group order 


Group addition 
in fis 


Group doubling 
in fis 


Scalar, mult, 
in ms® 




F254 


2162 


914 


317 


90 




F255 


2165 


917 


319 


91 


3 


F259 


2177 


1180 


415 


126 




F260 


2180 


921 


324 


100 




F261 


2183 


1183 


417 


130 




F263 


2189 


925 


329 


106 




F281 


2162 


618 


628 


128 




F283 


2166 


732 


756 


157 


2 


F288 


2176 


749 


774 


170 




F291 


2182 


754 


778 


177 




F295 


2190 


641 


650 


155 


1 


F2I9I 


2191 


598 


358 


100 



The timings for a scalar multiplication of genus-3 curves over F263 and of 
genus-2 curves over F295 are compared with the performance of the ECC scalar 
multiplication over F2191 . HECC of genus 3 is a factor of 1.1 and HECC of genus 
2 is a factor of 1.5 slower than ECC. Furthermore, a divisor scalar multiplication 
on a HEC of genus 2 performs a factor of 1.5 worse than a genus-3 HECC. The 
deviation of our implementation and the theoretical findings is at most 12%. 
Thus, we can conclude that our theoretical estimates were quite accurate. 

Table 6. Timings of the field library and corresponding M7-ratios. All timings in fis 
assuming a 80MHz clock rate. 



Field 


Multiplication 


Inversion 


M /-ratio 


F 263 


11.5 


73.7 


6.4 


F 295 


19.3 


157.2 


8.2 


F 2 I 9 I 


50.7 


469.9 


9.3 



A further speed-up can be achieved by the use of special reduction routines targeting 
a fixed irreducible polynomial. 
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7 Conclusions 

In this contribution, we were able to close the gap between the performance 
of HECC and ECC. In particular, an improvement of the explicit formulae for 
arbitrary characteristic for the case of genus-3 hyperelliptic curves was presented. 
For certain curves over fields of characteristic 2, the efficiency of the doubling 
algorithm could be enhanced drastically. This increased the performance of a 
scalar multiplication by over 50% compared to [KGM+02]. 

A theoretical comparison of ECC to HECC with coefficients in F 2 m assuming 
the currently fastest algorithms for field operations was also presented. An im- 
portant finding is that HECC can reach about the same throughput than ECC 
and that genus-3 HECC with h{x) = 1 are always faster than genus-2 HECC. 
However, the properties of the field libraries are the key to determine the overall 
performance of ECC and HECC. 

The theoretical results are confirmed by the first implementation of genus-2 
and genus-3 curves on an embedded processor. 
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A Explicit Formulae for Genus Three HEC 

The explicit formulae for the group operations on HEC of genus 3 and arbitrary 
characteristic as well as the most efficient formulae for doubling on a HEC with 
h{x) = 1 for characteristic two is presented in Tables 7 and 8. 

Table 7. Explicit formulae for adding on a HEC of genus three 



Input 


Weight three reduced divisors D-^ = and £>2 = (U2,f2) 

h = + h. 2 ^c^ + h-^x + /iQ, where ^ ^2 i 

/ = + f 9.^^ + f^^ + fn'^ 


Output 


A weight three reduced divisor £>3 = ( 143 , U 3 ) = + £>2 


Step 


Procedure 


Cost 


1 


Resultant r of u-^ and 112 


12M + 2S 




(Bezout) 




2 


Almost inverse inv = 


AM 




r / ui mod U 2 




3 


s' = rs = (i >2 — mod 


IIM 




142 (Karatsuba) 




4 


s = {s' /r) and make s monic 


7 + 6 M + 2S 


5 


z = sui 


6 M 


6 


u' = [s(z + W 4 (h + 2vi)) - 


15M 








7 


v' = —(11132 + 11 + 1 ;]^) mod u' 


8 M 


8 


14 ', i.e. 143 = (/ - v'h - 


5M + 25 








9 


U 3 = ~{v' + h) mod 143 


3M 


Total 


in fields of arbitrary charac- 


I + TOM + 65 




teristic 






in fields of characteristic 2 


7 + 65M + 65 



Table 8. Explicit formulae for doubling on HEC of genus three 



Input 


A weight three reduced divisors £>]^ = (i4], U]^) 
h = x^ + + h^x + Hq, where ^ F 2 i 

/ = *’’ + + /o; 




Output 


A weight three reduced divisor 
-^2 = (-U2’ '^ 2 ) = [2]-D] 




Step 


Procedure 


Cost 


1 


Resultant r of i4]^ and h + 2v^ (Bezout) 


6M + 25 


2 


Almost inverse inv = r/(h + 2u]) mod 14 


4M 


3 


2 = ((/ - hui - i.^)/i4i) mod 141 


7M + 25 


4 


s' = zinu mod (Karatsuba) 


IIM 


5 


s = (s'/r) and make s monic 


7 + 6M + 25 


6 


G = sui 


6M 


7 


u' = U-^[(G + »4t,i)2 + „4feG + »5(4 «i - /)] 


5M + 25 


8 


v' = — (01113 + h + U]^) mod u' 


8 M 


9 


u', i.e. 1*2 = (/ - "'I* - 


5M + 25 


10 


^>2 = mod 142 


3M 


Total 


in fields of arbitrary characteristic 
in fields of characteristic 2 


7 + 61M + 105 
7 + 53M + 105 


I 


d = gcd(u 2 ^ , 1) = 1 = s^a + 33 / 1(53 = 1, si = 0) 


- 


II 


— 4,2 

!>' = uj + / mod u' 


35 


III 


35 


IV 


<*" = ((/ - hv' - v'^)/u') 


3M + 35 


V 


142 ~ made monic 


17 + 2M 


VI 


i >2 = —(tj' + h) mod uo (Karatsuba) 
143 := (/ - U 2 * h - v\)lu 2 


5M 


VII 


IM + 25 


VIII 


^3 — -('^2 + ^^3 


3M 


Total 


in fields of characteristic 2 and with /i(a;) = 1 


7 + 14M + 115 





Countermeasures against Differential Power 
Analysis for Hyperelliptic Curve Cryptosystems 



Roberto M. Avanzi* 

Institut fiir Experimentelle Mathematik 
University of Duisburg-Essen (Essen site) 
Ellernstrafie 29 - 45326 Essen, Germany 
mocenigoOexp-math . uni-essen . de 



Abstract. In this paper we describe some countermeasures against dif- 
ferential side-channel attacks on hyperelliptic cnrve cryptosystems. The 
techniques are modelled on the corresponding ones for elliptic curves. 
The first method consists in picking a random group isomorphic to the 
one where we are supposed to compute, transferring the computation to 
the random group and then pulling the result back. The second method 
consists in altering the internal representation of the divisors on the 
curve in a random way. The impact of the recent attack of L. Goubin is 
assessed and ways to avoid it are proposed. 

Keywords. Public-key cryptography. Side-channel attacks, Differen- 
tial power analysis (DPA), Timing attacks, Hyperelliptic curves. Smart 
cards. 



1 Introduction 

The use of Jacobian varieties of hyperelliptic curves in discrete logarithm cryp- 
tosystems was proposed by N. Koblitz as early as 1988 [17,18] as an alternative 
to elliptic curves. Hyperelliptic curves are a generalisation of elliptic curves: the 
latter are just the hyperelliptic curves of genus one. 

Until very recently, however, elliptic curve cryptosystems (short: ecc) have 
been perceived as faster than hyperelliptic systems (short: hecc, but some other 
authors prefer abbreviations like hec or hcc) of genus at least two and offering 
comparable security. An important milestone in the road to change this per- 
ception happened in September 2002: at the ECC 2002 Workshop in Essen, 
K. Nguyen of Philips Research reported on his implementation on a hardware 
simulator of T. Lange’s projective formulae for genus 2 [25]. This showed for 
the first time that the performance of hecc can be competitive, even for smart 
card applications. Shortly afterwards J. Pelzl, T. Wollinger, J. Guajardo and 
C. Paar [38] obtained efficient formulae for genus 3 hyperelliptic Jacobians in all 
characteristics improving on the work of [23]. 

* The work described in this paper has been supported by the Commission of the Eu- 
ropean Communities through the 1ST Programme under Contract IST-2001-32613 
(see http : Z' WWW. arehcc . com or http://www.arehcc.org). 
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This raises immediately the issue of the security of hecc against side-channel 
attacks, first introduced in the form of timing attacks in [20] and then simple and 
differential power analysis (SPA and DPA) [21,22]. These attacks measure some 
leaked information of a cryptographic device (e.g. timing, power consumption, 
electromagnetic radiation) while it processes its inputs. For historical reasons 
we just write DPA also when exploiting leaked information other than power 
consumption. If a single input is used, the process is referred to as a Simple Power 
Analysis (SPA), and if several different inputs are used together with statistical 
tools, it is called Differential Power Analysis (DPA). We are concerned here with 
the second type of analysis. 

SPA attempts to recover the secret scalar from one observation of the se- 
quence of operations: For example, in a simple double- and- add algorithm the 
number of consecutive group doublings minus one is the amount of zeros be- 
tween two ones in the binary representation of the scalar. For ecc there exist 
two anti-SPA strategies. 

The first strategy aims at making the sequence of group operations seem- 
ingly independent from the scalar. In the “double-and-add-always” [7] scalar 
multiplication method an addition is performed after each doubling, even if the 
corresponding digit of the scalar is zero: This can be done of course in any group, 
including the Jacobians of hyperelliptic curves. For curves in odd characteristic 
admitting a particular form, the “Montgomery” method [33,37] allows a very fast 
computation where the y-coordinate is not used. Analogues of this idea exists 
for binary curves [1,30] and for all elliptic curves over prime fields [3,8]. 

The second strategy relies on indistinguishable addition and doubling formu- 
lae. They exist for many classes of curves, such as those in Hessian [15,40] or in 
Jacobi Form [28]. E. Brier and M. Joye found such formulae for elliptic curves 
over all fields [3]. Another way of pursuing this strategy is to insert dummy 
operations: for an even characteristic example see [2]. 

At the moment of this writing little has been done to protect specifically 
a hecc against SPA. The only currently known methods are the generic ones 
such as: (i) the insertion of dummy group additions in the scalar multiplication 
algorithm (as in the “double-and-add-always” method) or (ii) the insertion of 
dummy field operations in the addition and doubling formulae. T. Lange [27] 
remarked that the latter can be realized easily and efficiently with the genus 
2 affine formulae: this is particularly important for the applications, since the 
formulae are simpler than in the genus 3 and 4 cases, and the security of genus 
2 curves is better understood. 

Henceforth we shall always assume that the scalar multiplication algorithm 
has been made immune from SPA by at least one of these two techniques. 

In a DPA the side-channel information collected upon processing of several 
different inputs is correlated with the value of a boolean function y of the internal 
representation of the operands in the cryptographic hardware. The attacker, 
which is assumed to know the algorithm, guesses that the hardware will perform 
a specific operation at a given point - for example which operand from a table 
is reused, or which branch is taken - depending on some part of the secret 
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information to be elicited. The inputs are then sorted in two sets according to 
the values of x on the corresponding guessed outputs. If the statistical correlation 
with the leaked information is good, the guess was correct. This leads to attacks 
which require time linear in the length of the cryptographic operation. We refer 
the reader to [20,21,22] for more details. Short descriptions can also be found in 
[7, §3] and [16, §§3.2 and 3.3]. 

The present work is a first attempt to harden hecc against DPA. In the next 
section we develop hyperelliptic curve analogues of Coron’s third countermeasure 
[7] (point randomisation) and of the curve randomisation method of M. Joye and 
C. Tymen [16]. The impact of the recent results of L. Goubin [13] is discussed. 
We also discuss the applicability of such techniques in light of: (i) the state of the 
art of explicit formulae for divisor addition and (ii) security results for specific 
classes of varieties. An appendix contains an example of explicit transformations 
for the curve randomisations in genus 2. 

2 The Techniques 

2.1 Curve Randomisation 

An excellent, low brow introduction to the subject of hyperelliptic curves, with 
a detailed derivation of the facts used below, is given in [31]: Our notation is 
slightly different, but conforms to that of [24,25,26,27,38]. 

The idea behind curve randomisation techniques is to “scramble” all the bits 
of the computation in a (hopefully) unpredictable way. It consists in picking 
a random group isomorphic to the one on which the cryptosystem is based, 
transferring the computation to it and then pulling the result back. 

More formally, let C and C' be two hyperelliptic curves of genus g ^ 1 over 
a finite field Fg. Suppose that : C — >■ C is an F^-isomorphism which is easily 
extended to an F^-isomorphism of the Jacobians </> : 77(C) — >■ 77(C). Let us 
further assume that (j>, together with its inverse, is computable in a reasonable 
amount of time, i.e. small with respect to the time of a scalar multiplication. We 
do not require a priori the computation time of ([ to be negligible with respect 
to a single group operation. Then instead of computing Q = nD in 77(C)(Fq), 
where n is an integer and D € 77(C)(Fg), we perform: 

g = r'(n0(77)) (1) 

so that the bulk of the computation is done in 77(C) (Fg), or, since a picture is 
worth a thousand words, we note that the following diagram commutes 

J(C)(Fg) -"^Itiplicationbyn ^ J{C){¥,) 




77(c) (Fg) ^ 77(c) (Fg) 

multiplication by n 

and we follow it along the longer path. 
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The countermeasure is effective if the representations of the images under 
(j) of the curve coefficients and of the elements of J{C){¥q) are unpredictably 
different from those of their sources. This can be achieved by multiplying the 
quantities involved in a computation with randomly chosen numbers (but: see 
Subsection 2.3). We are going to show that, in the case of hyperelliptic curves, 
this can be done by a small number of elementary field operations. 

We do not discuss the use of random field isomorphisms according to [16, 
§4.2]. The treatment carries over with little or no changes, and the method is 
computationally heavy, considerably slowing down all field operations. It is not 
clear whether it can even be done on a smart card in the ecc case. In hecc, 
the ground field being smaller, it is possible that this countermeasure could 
be implemented. As there is a potential performance/security trade-off in even 
characteristic with curve randomisations (see §2.1), especially in the genus 2 
case, one might be tempted to reconsider the use of field isomorphisms: However, 
divisor randomisation (see §2.2) makes them superfluous. 

General curve isomorphisms. We now put in practice the idea just sketched. 
Let 5 ^ 1 be an integer, and be a finite field. Let C,C be two hyperelliptic 
curves of genus g defined by Weierstrass equations 

C ■. + h{x)y - f{x) = 0 (2) 

C : y"^ + h{x)y - f{x) = 0 (3) 

over Fg, where /, / are monic polynomials of degree 2g + 1 va x and h{x), h{x) 
are polynomials in x of degree at most g. C (and C) has no singular affine points, 
i.e. there are no solutions (x,y) G Fg x F^ which simultaneously satisfy the 
equation y^ + h{x)y — f (x) = 0 and the partial derivative equations 2y + h{x) = 0 
and h'{x)y — f'{x) = 0. This is equivalent to saying that the discriminant of 
4/ -I- h'^ does not vanish [29, Theorem 1.7]. Analogous conditions holds for C. 
Denote by oo the non affine point in the projective completions of C and C. All 
Fq-isomorphisms of curves (j) : C ^ C are, by [29, Proposition 1.2], of the type 

4> : {x,y) i^s~‘^x + + A{x)) (4) 

for some s G F^, 6 G Fg and a polynomial A{x) G Fg[a;] of degree at most g. 
Upon substituting s~'^x + h and + A{x) in place of x and y in equation 

(3) and comparing with (2) we obtain 

f h{x) = {h(^s~^x -I- 6) -I- 2 A{x^ 

1^ f{x) = + b) — A{xY — h(^s~^x + b)A{x)^ 

whose inversion is 

( h{x) = h{x) — 2A{x) 

\ f{x) = S-2(29+l)/(x) + _ A(xf 

( where x = s^{x — b) . 



(6) 
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Now (j) is an isomorphism of C onto C and it induces an isomorphism (which 
we also call 4>) of their Jacobians, which is also a group isomorphism. It is a well 
known fact that the Jacobian of a curve C is isomorphic to the ideal class group 
C1°(C), which is more suitable for direct computations, and for this reason we 
want to see see how (p operates on the elements of C1°(C). 

D. Mumford [35] has introduced a representation of the elements of the latter 
group as polynomial pairs, for which D. Cantor [4] provided an explicit arithmetic 
algorithm. Any divisor can be written as D = ~ rnp)co for 

a finite subset of points 5 of C (F^) called the support of D, the rm being positive 
integers, and the degree of D is the integer deg(D) = J2pes''^P- ^ 
unique principal divisor of degree at most g in a given divisor class on C. Then 
(the ideal class associated to) D is represented by a unique pair of polynomials 
U{t),V{t) G Fq[t] with g ^ degj U > deg^ V, U monic and such that: 

u{t)= \{{t-xpr- 
Pes 

V{xp) = yp for all P G 5 
U{t) divides + V{t)h{t) — f(t) . 

We say that the pair [U{t),V{t)] represents the reduced divisor D. It is deg(P) = 
deg(C/). 

It is clear that we want to find a pair of polynomials U{t),V {t) G F^ [t] 
which satisfy similar conditions, but for the divisor 4>{D) = rnp(f>{P) — 

iJ2p^s mp)oo in place of D. In other words, we must have: 

D= T^pP ~ ( 'nT'p) oo — ^ Z) mp(p{P) - ( ’m-p) oo = 4>{D) 

Pes \Pes ) Pes \Pes ) 



This is very straightforward to obtain. Clearly 

U{t)= n “ ^4>{P))'^'' = n “ s~‘^xp - 

P(sS P&S ( 8 ) 

= ™r-C/(s2(i _ 5)^ = g-2deg, 

Then, V must satisfy V{x^(^p)) = y(p{p) for all P G 5, in other words 

V(s~^xp + b) = s~^'^^~^^^yp + A{xp) = s~^‘^^'^^^V{xp) + A{xp) 



V{t) = S-(29+l)y (g2(^ _ ^ 



(9) 
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Equations (8) and (9) give the correct U{t) and V{t). This follows from the 
uniqueness of the representation of reduced divisors: In fact U{t) and V{t) are 
defined over F^, degE = degE < degU = deg IJ, and it is straightforward to 
verify that U{t) divides + V{t)h{t) — f{t). 



Odd characteristic. Here we consider the case where is a finite field of 
odd characteristic. We assume that h{x) = h{x) = 0, since we can transform 
the equations by the variable change y ^ y — h{x)/2 and y ^ y — h{x)/2. The 
advantage in doing so is that Cantor’s algorithm will run faster, and for the same 
reason explicit formulae for odd characteristic have only been developed under 
this assumption. Then the equations of C,C are of the form 

C ■. y^- fix) = 0 (10) 

C : y^- /(x) = 0 (11) 

which imply, by (6), that A{x) = 0. 

If, furthermore, charF^ | 2^ + 1, we can assume that the second most sig- 
nificant coefficient of f{x) (and of f{x)), i.e. the coefficient f^g of vanishes 
too, since we can perform the variable change x ^ x — f2g/{2g+ 1). In this case, 
moreover, by (6) it must be & = 0, so the isomorphism <j) takes the simple form 

<t> ■ (a;,y) (12) 

where s G F^. (For simplicity, we shall consider only isomorphisms of this kind, 
even if charF^ | 2g + 1.) The formula for / is 

This randomisation modifies all non-zero coefficients of the Weierstrass equation 
(that is, all those who are used in the computation) and of the two polynomials 
representing a reduced divisor (except for the leading coefficient of U, which 
must remain equal to 1), namely 

U{t) = V{t) = s-(29+^V(s2t) . 

Explicit description, an implementation trick. The method is very fast. First, 
we pick a random s G F^ and compute its multiplicative inverse. They are both 
needed: for (j) and s for <f>~^ . We make the computation of (f> explicit. If 

23-1 

fix) = x^o+^ + 

i^O 



then 



fix) 



= ^2^-2i2g+l) 

z=0 
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For U{t) and V{t) in the general case it is 



so that 



5-1 

U (t) = ^ Uif and V (t) 

i=0 



5-1 



2 = 0 



9-1 

U{t) = to + J2 and V{t) 

2=0 



9-1 

i=0 



To apply (j) to the equation of the curve and to the basis divisor [U{t),V{t)] 
we proceed as follows: Assume we have already s and s“^. We compute s~^ for 
k = 2,3, . . ,2g + 1 and k = 2{g + l),2{g + 2), . . . , 2(2g + 1). This requires 3(/ + 1 
multiplications (some can be replaced with squarings). For even k we compute 
f 2 g+i-k /2 = s~’^f 2 g+i-k /2 (if k 2) and Ug-k /2 = s~’^Ug_k /2 (with fc < 2g). 
If k is odd and ^ 2g + 1 we multiply Vg_(fc_i )/2 by to obtain 
Computing f, U and V requires 4g multiplications, hence the total amount of 
operations required to apply </> is 7g+l multiplications. Computing (j>~^ requires 
only Ag multiplications in F^, bringing the total to llg + 1. 

In the cases g = 2, resp. 3 this randomisation needs 23, resp. 34 field mul- 
tiplications (and possibly one inversion), which compares favorably to the costs 
of one group addition: in the genus 2 case, according to T. Lange [24] one group 
addition requires 25 multiplications and 1 inversion, and in the genus 3 case 
J. Pelzl et al. [38] need 76 multiplications and 1 inversion. 



We mention an implementation trick to save an inversion each time the device 
is used at the price of a multiplication. During the initialisation of the device, 
a set (Kj,K“^) of randomly chosen elements of together with their inverses 
is stored in the E^PROM. Before each cryptographic operation, two random 
indices i ^ j are picked, and the i-th pair is replaced by (k^ • Kj, • k“^). The 
result is used as the (s, s“^) for the curve randomisation in the current session. 



Partial conclusions. Curve randomisation in odd characteristic is a fast coun- 
termeasure. The total amount of operations required to apply this technique is 
either comparable with that of a single group operation or much smaller. 



Even characteristic. The discussion in § 2.1 applies in particular to the case of 
even characteristic. Let d = [F, : F 2 ]. Since in this case one must have h{x)h{x) yf 
0 in equations (2) and (3), it is clear that applying the isomorphisms in general 
will not be as efficient as in the odd characteristic case. 

In place of the fully general isomorphisms (4) we assume & = 0 and A{x) = 0, 
and proceed as at the end of §2.1. The isomorphisms of the form 

4 > : (x,y) (12') 

for generic s G F 2 d \ F 2 randomise the coefficients of the equation as follows 
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h{x) = 

/(x) = S-2(29 +D/(s2,,) . 

As in §2.1 we make this explicit: if 

2g-l 

f{x) = ^ /iX* and h{x) 

i=0 

then 

2g-l 

f{x) = ^ s2*-2(2s+i)/.a;* and ft(a:) 

i=0 

and the formulae for U ,V are the same as in §2.1. All the coefficients of the 
equation and of the divisor are then multiplied by random constants. In even 
characteristic we must compute also the coefficients of h{x) from those of h{x). 
Hence, at most 5 + 1 field operations more are required than in the odd charac- 
teristic case, bringing the cost of the computation of (j) to at most 8g + 2 multipli- 
cations, after s has been randomly chosen and computed. The computation 
of (j)~^ still requires 4g multiplications. The total cost of this randomisation is 
thus 12g+2 field multiplications and one inversion: The implementation trick de- 
scribed in § 2.1 not necessary in even characteristic, inversion being much faster 
in this case. 

Restricting h: h constant. In even characteristic often the coefficients of h{x) 
are restricted for performance reasons. In this paragraph we consider the case 
where h{x) is a non-zero constant. Equation (6) implies that h{x) will also be a 
non-zero constant. 

It is an established fact in algebraic geometry that curves of equation -|- 
cy = f{x) with deg / = 5 and c yf 0 are supersingular [11, Theorem 9] and so 
are not suitable for the cryptographic applications we have in mind. 

On the other hand there are no hyperelliptic supersingular curves of genus 
3 in characteristic 2 [39], so curves of the form + cy = f{x) where deg / = 7 
and c yf 0 do not appear to be weak provided that parameters as extension 
degree and group order are suitably chosen. Now, even though in [38] a very 
fast doubling formula is given for the doubling in the case h{x) = 1, J. Pelzl 
has privately communicated to us that in the generic case where h{x) is a non- 
zero constant h{x) = c G F2 <j doubling speed can still be improved dramatically. 
Trivially, h{x) = = s~'^c. This makes the genus 3 case important. 

Restricting h: h non-constant but defined over ¥ 2 - Another technique for gaining 
performance is to choose h{x) non-constant but defined over F2 (see for example 
[25] and [26]). By (6) this leads to the question: if h{x) G F2[a;], for which 
elements 6 G F, and s G F^ is it h{x) = — 6)) G F2[a:] ? 

The leading coefficient of h{x) equals s~'~ where r = {2g -I- 1) — 2 deg ft-, and 
since it cannot vanish, it is 1, i.e. s'" = 1. Now r is an odd positive integer 



(13) 



= ^ ftjCC* 



i=0 



= £s2*-(2g+l)/^.^ 



i=0 
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bounded by 2g—l. The cryptosystem must withstand P. Gaudry’s low genus al- 
gorithm for computing discrete logarithms in hyperelliptic Jacobians [12]. Hence 
g must be small, in fact g ^ 4, so r ^ 7. This implies that s can take only very 
few possible values, making superfluous the effort of randomising it. 

Remark: In order to make Weil Descent attacks [9,10] infeasible, the extension 
degree d is usually taken to he a large prime number p ^ 160/(/ or twice a prime 
p ^ SO/g. Recall also that 5^4. The possible values of s are limited to the roots 
of irreducible factors of X'^ — 1 of degree dividing d. If d = p ^ 160/g ^ 40, 
which is also the preferred case, then s = 1. If d = 2p with p ^ 80/g ^ 20, s can 
only be a root of a factor of X'^ — 1 of degree at most 2 and irreducible over F 2 . 
A quick verification of such factors (recall that r is odd and ^7) implies that 
either s = 1 or r = 3 and s'^ + s + 1 = 0. If two coefficients of h{x) are equal 
to 1, forcing the corresponding coefficients of h{x) to he also equal to 1 implies 
always s = 1. 

Let a be the Frobenius automorphism of Fg/F 2 , i.e. a 1 -^ a^. Now h{x) = 
h{x — b) G F 2 [x], hence h{—b'^^) = h{—bY^ = h{—b) G F 2 for all j. In other words 
all distinct conjugates of —b are roots of h{x) — h{—b) = 0, and if 6 ^ F 2 there 
are at least p ^ SO/g ^ 20 of such conjugates, including —b. But the degree of 
h, as we already know, is at most g ^ 4, and this forces b G F 2 . There are only 
two choices for 6, making useless to consider its randomisation. 

We see that the isomorphisms we can use are of the form 
(j) : (x,y) (x,y + A{x)) 

where the polynomial A{x) G Fg[x] has degree at most g. The situation is similar 
to that for elliptic curves as described in the already cited paper of M. Joye 
and C. Tymen: we can efficiently randomise only one of the two polynomials 
(V, whereas U will be left untouched), or, in other words, only a half of the 
coordinates. In fact, by (6) not all coefficients of / are randomised in /, increasing 
the likefihood of successful bit-correlations if this countermeasure is used alone. 

Partial conclusions. We conclude that for genus 2 hyperelliptic curves in char- 
acteristic 2, curve randomisation is not adequate if one wants to force the coef- 
ficients of h to lie outside F 2 . 

In the genus 3 case curves of equation y^ -\- cy = f{x) can be randomised 
obtaining good performance and security. 

In all other cases, we recommend other techniques, such as divisor randomi- 
sation, which also works in odd characteristic. We sketch it in the next section 
in the case of genus 2. 

2.2 Divisor Randomisation in Genus 2 

Divisor randomisation works by randomising the bits of the representation of a 
reduced divisor, which can be either the base group element of the cryptosystem 
or any intermediate result of the computation of a scalar multiplication. This 
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technique does not scramble the bits of the internal representations of the coeffi- 
cients of the curve. It can be used whenever a group element can be represented 
in several different ways. Notable examples are the projective coordinates on 
elliptic curves: two triples {X, Y, Z) and {X', Y' , Z') represent the same point if 
and only if there exists a non zero element s in the base field such that X = sX', 
Y = sY' and Z = sZ' . With Jacobian coordinates [5], two triples {X, Y, Z) and 
(X' ,Y' , X') represent the same point if and only if = s^X', Y = s^Y' and 
Z = sZ'. 

Recently, alternative coordinate systems for genus 2 hyperelliptic curves have 
been proposed: An inversion- free system by Miyamoto et al. [32] which operates 
on the hyperelliptic analogue of projective coordinates, later extended and im- 
proved by Lange [25], who also developed an analogue of Jacobian coordinates, 
called the new (or weighted) coordinates [26]. We are not aware of similar coor- 
dinate systems for genus 3 curves. Furthermore, as the genus of the considered 
curve increases, the size of the base field decreases, and the cost of a field inversion 
relative to a field multiplication also decreases quickly. This makes inversion-free 
formulae in genus at least 3, not so desirable from the point of view of raw per- 
formance, because they trade a single inversion for a lot more multiplications 
than the affine formulae. 

In projective coordinates a divisor class D with associated reduced polyno- 
mial pair [U{t),V{t)] is represented as a quintuple [C/i, C/q, Vi, Vq, Z] where 

u{t) = f + ^ ^ Y ■ 

The randomisation in this case consists in picking a random s G and by 
performing the following replacement 



[U,,Uo, Ri, Ro, ^] ^ [sUi,sUo, sVi,sVo, sZ] 



In new (weighted) coordinates a divisor class is represented by means of six 
coordinates [Ui,Uo,Vi,Vo, Zi, Z2] where 



U{t) =t^ + ^t+^ and V (t) 

Zl 



ZfZ 2 ZfZ 2 ■ 



To blind the base point or an intermediate computation, two elements si, S2 are 
picked in at random and the following substitution is performed: 



[Ui, Uq, Vi, Vq, Zl, Z2] I— >■ [siUi, SiUq, S1S2V1, s^S2Vo, SiZi, S2Z2] 



If (some or all of) the additional coordinates 21,2:2, ^3 and 24 are used - which 
satisfy z\ = Z(, Z2 = Z2, Z3 = Z1Z2 and 24 = Z\Z2 - then they must also be 
updated: the fastest way is to recompute them from Z\ and Z2 by two squarings 
and two multiplications. 
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2.3 Goubin’s Attack May Force Further Blinding 

Recently L. Goubin [13] has pointed out a potential weakness of some ecc ran- 
domisation procedures, including Coron’s third and Joye-Tymen’s, when imple- 
mented on systems where the secret scalar is fixed and the base of the scalar 
multiplication (the message) can be chosen. Since our techniques generalise the 
above ones, it is natural to investigate how Goubin’s ideas might affect our work. 

His attack is based on the randomisation of 0 by multiplication by a constant 
or by field isomorphism being still 0. It relies also on the fact that the scalar 
multiplication algorithm has a fixed sequence of group operations for a given 
scalar - even after removing any dummy operations. (It should work also if the 
number of possible operations sequences for a given scalar is small enough.) 

Suppose that the most significant bits rir, Ur-i , . . . , nj+i of the secret scalar 
n are known and that we want to discover the next bit Uj. Assume also that a 
chosen message attack can be set up to obtain in a specific step of the scalar 
multiplication - namely the one corresponding to the processing of nj - a point 
or a divisor with one or more coordinates equal to zero, provided that Uj has 
been guessed correctly (that divisor can be tD where D is the “message” and t = 
{urjUr-i, . ■ ■ ,nj+i,nj) 2 )- The side-channel trace correlation may reveal if the 
guess was correct even in presence of multiplicative randomisation procedures, 
because some multiplications by zero will occur in any case. In particular, this 
can affect the random isomorphisms of the form (j) : (x,y) i— 
and the divisor randomisation techniques of Subsection 2.2. 

An approach to thwart Goubin’s attack could use the more general isomor- 
phisms (4) with 6, A{x) yf 0 to randomise also the vanishing coefficients of the 
divisors: this has the disadvantage of requiring curve equations in general form 
and thus slower formulae for the operations. 

There is a development of Goubin’s ideas which might be even more serious. 
We first fix some notation: A is the large prime order subgroup of Gl*^(C)(Fg) 
used in the cryptosystem and i its order. 

A variant of Goubin’s attack may exploit the fact that the basic explicit 
formulae for small genus hyperelliptic curves only deal with the most common 
cases (cfr. [24,27,38]). They do not hold if the divisors given as input to a group 
operation satisfy exceptional conditions, such as: 

(i) If the reduced divisor Di, for z = I, 2, is represented by the polynomial pair 
\Ui, Vi], then the greatest common divisor of Ui and U 2 is non-constant or, 
equivalently, their resultant is vanishing. 

In this case we say that Di and D 2 collide. This happens if the supports 
of D\ and of either D 2 or —D 2 have at least one point in common. 

(ii) deg(T>i) < g or, equivalently, deg(C/i) < g (this applies to additions as well 
as to doublings). 

Such situations occur in practice with very small probability for curves 

over Fg), hence no separate formulae for these cases are implemented and either 
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Cantor’s algorithm or formulae with quite different characteristics are used^. 
Since the characterising properties of these divisors are of geometric nature, 
they are preserved under curve isomorphisms. Their occurrence may thus be 
induced at prescribed points to verify the guesses of the bits of the scalar. At 
least in theory, the attacker guesses that at some point the scalar multiplication 
algorithm adds D to tD (resp. doubles tD) and therefore chooses D to collide 
with tD (resp. deg{tD) < g). We do not know, except for very simple cases 
(i.e. |t| < g), how to produce for a given t (with £]t) a, divisor D colliding with 
tD (reduced) - we suspect that it is in general a hard problem. On the other 
hand it is very easy to find D such that deg{tD) < g: just pick any D' G A with 
deg (O') < g, find an integer s with st = 1 (mod £) and put D = sD'. Then 
tD = D' and, if the doubling formula for the exceptional case is distinguishable 
from the generic one, the attack can be launched. 

This represents an obvious danger with affine coordinates: if one of the above 
exceptional conditions occurs, the most common case formulae cannot be used, 
to avoid a division by zero. With inversion-free coordinate systems the situation 
is only apparently different: one can just use only the most common case formulae 
and check at the end of the scalar multiplication if the divisor belongs to the 
curve - but also in that case anomalous behaviour of the device at the end of 
the scalar multiplication could be detected. 

We therefore need additional scalar and message blinding methods. 

We briefly discuss scalar blinding methods. Their purpose is to render un- 
predictable the addition chain used in the scalar multiplication, thus preventing 
the attacker to guess for which integers t group operations of the type D + tD 
or 2{tD) are actually performed. 

The first is Coron’s first countermeasure [7], i.e. the replacement of the scalar 
n with n-|-fc^ in nD for a random integer k. This technique can be traced back to 
[20], and was shown [36] to leave a bias in the least significant bits of the scalar. 
B. Moller [34] combines it (only in the ecc case) with an idea of C. Clavier and 
M. Joye, and suggests the computation of nD = [n + ki + k 2 £)D — k\D, where 
k\ and k 2 are two suitably sized random numbers: k\ and k 2 should be large 
enough to make L. Goubin’s attack not palatable, yet not too big, to leave the 
overhead tolerable (for example fci, ^2 ~ 2^^ are good choices if £ « 

For a completely different technique see [41]. 

For message blinding, a hecc analogue of Coron’s second method [7] consists 
in replacing the computation of nD with that of n{D + R) — S, where R G A 
is a secret divisor for which S = nR is known. A set of secret divisor pairs 
{Ri, Si) G A X A with Si = nRi can be stored in the smart card at initialisation 
time, and at each run both elements of a randomly chosen pair are multiplied 
by the same small signed scalar and added to the respective elements of another 
pair. The result is then used to randomise the scalar multiplication. 

^ To provide explicit and indistinguishable formulae for all cases would be a formidable 
feat - and would probably slow down considerably the cryptosystem. 
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Suppose that a computation involving tD has to be done during the scalar 
multiplication, either D + tD or 2{tD), and that either D collides with tD (this 
is relevant only to the addition D + tD) or deg{tD) < g - as wished by the 
attacker. If D has been replaced at the beginning by D+R for a randomly chosen 
point R, then D + R and t{D + R) will collide with probability 0{q~^) (this is 
actually a conjecture which has been extensively confirmed experimentally on 
small curves), resp. deg{t{D + R)) = g also with probability 0{q~^)\ The last 
statement holds because t{D + R) is in practice a random point, which implies 
also that, even if tD had some zero coordinates, a fixed coordinate of t{D + R) 
would be zero with probability q~^. 

We infer that this type of message blinding (which, if used alone, might arouse 
suspicion) thwarts Goubin’s attack. Due to a similar underlying philosophy, ad- 
ditional message blinding should be effective also against some hecc analogue of 
the “exceptional procedure attack” [14]. 

To prevent a variant of Goubin’s attack, we recommend to use at least an 
additional scalar or message blinding method besides our randomisation proce- 
dures. The hyperelliptic analogue of Coron’s second countermeasure, being less 
expensive than scalar blinding, looks particularly attractive. The isomorphism (f> 
need not be of the most general type described in 2.1, but the conclusions of 2.1 
and the caveats of 2.1 still apply. 



3 Conclusions 

We proposed two methods to blind the base divisor class for hyperelliptic curve 
cryptosystems, in order to provide resistance against DPA. 

The first method consists in transferring the critical computation to the 
Jacobian of a different randomly chosen isomorphic curve. It can be applied 
to curves of all genera. 

The second method is a hyperelliptic analogue of Coron’s third countermea- 
sure. It applies only to families of curves for which we know explicit formulae 
for hyperelliptic analogues of elliptic curve projective and Jacobian coordinates. 
Explicit examples in the case of genus 2 have been worked out in detail. 

These techniques are easy to implement and do not impact the performance 
significantly. In fact their cost is at most that of a single group addition. 

In conjunction with suitable additional scalar and message blinding tech- 
niques, they can be made resistant against Goubin’s recent chosen message at- 
tack, as well as against a possibly more serious variant of the latter based on the 
structure of the divisors on hyperelliptic curves. 
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Appendix: Explicit Transformations for the Curve 
Randomisation for Genus 2 

As an example, in this appendix we write down the transformations for the 
curve randomisation method explained in Subsection 2.1, for g = 2. In view of 
the results of §2.1, we consider here only the curve isomorphism of type (12) 
where the equations of C and C are given by 

C : y"^ -\- h{x)y — f{x) = 0 and C : y^ -\- h{x)y — f{x) = 0 . 

The polynomials f{x) and h{x) are of the form 

f{x) = x^ -\- f 4 X^ -h fsx^ + f 2 X^ + fix -h fo and 
h{x) = h^x"^ -\- hix -\- ho . 

Their images are 

f{x) = x^ -\- -I- s~^fox^ -\- s~^f 2 X^ -\- s~^fix-\- and 

h{x) = s~^ h 2 x'^ -\- s~^ hix -\- s~^ ho . 

If the base divisor is given hy D = [U{t),V{t)] with deg({7) = 2, 

U{t) =f + Uit + Uo and V{f) = Vit + Vo 
then its image [lJ{t),V{t)] is 

U{t) = -\- s~‘^Uit -\- .s~'^Uo and V (t) = s~^Vit -\- s~^Vq . 

If deg{U) = I, i.e. U{t) = t -\- Uo, then its image is tj{f) = t -\- s~‘^Uq, whereas 
the image of V (t) is independent of the degree: in this case V (t) = Vo and thus 
V{t) = s“^Vo. The inverse transformation from 4>{D) to D is obvious. 

The total number of field operations is at most 26 multiplications in the even 
characteristic case, 23 multiplications in odd characteristic (because h = 0), and 
one inversion (but: see end of 2.1). 




A Practical Countermeasure against Address-Bit 
Differential Power Analysis 



Kouichi Itoh^, Tetsuya Izu^, and Masahiko Takenaka^ 



^ FUJITSU LABORATORIES Ltd., 

64, Nishiwaki, Okubo-cho, Akashi, 674-8555, Japan 
^ FUJITSU LABORATORIES Ltd., 

4-1-1, Kamikodanaka, Nakahara-ku, Kawasaki, 211-8588, Japan 
{kito, izu, takenaka}@labs . fujitsu. com 



Abstract. The differential power analysis (DPA) enables an adversary 
to reveal the secret key hidden in a smart card by observing power 
consumption. The address-bit DPA is a typical example of DPA 
which analyzes a correlation between addresses of registers and power 
consumption. In this paper, we propose a practical countermeasure, 
the randomized addressing countermeasure, against the address-bit 
DPA which can be applied to the exponentiation part in RSA or 
ECO with and without pre-computed table. Our countermeasure 
has almost no overhead for the protection, namely the processing 
speed is no slower than that without the countermeasure. We also 
report experimental results of the countermeasure in order to show its 
effect. Finally, a complete comparison of countermeasures from various 
view points including the processing speed and the security level is given. 

Keywords. Differential Power Analysis (DPA), address-bit DPA, coun- 
termeasure, exponentiation, RSA, ECC 



1 Introduction 

Smart cards are becoming a new infrastructure of the coming IT society for their 
plenty of attractive applications such as identification cards, telephone cards and 
electronic tickets. However, the side channel attacks are real threats for them 
[19,20]. In the attack, an adversary observes side channel information such as 
computing time and power consumption. The adversary can obtain the secret 
information if there is a tight relation between the side channel information and 
the secret information (secret key) hidden in the smart card. Especially, if there 
is a irregular procedure (short-cuts) in the computation, the adversary can easily 
detect it. Thus the adversary could reveal the secret key without tampering the 
device physically. The simple power analysis (SPA) and the differential power 
analysis (DPA) are typical examples of the side channel attacks. Implementers 
of cryptographic schemes should take countermeasures against these attacks. 

In 1999, Messerges et al. proposed a new powerful attack against the secret 
key cryptosystems, the address-hit DPA (ADPA) (from now on, we call the pre- 
vious DPA as the data-bit DPA (DDPA) ), which analyzes a correlation between 
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the secret information and addresses of registers [26]. Then, in 2002, Itoh et al. 
extended the attack to Elliptic Curve based Cryptosystems (ECC) [11]. These 
results suggest that implementers should consider the correlation of the secret 
information to not only data of registers but also addresses of registers. Itoh et al. 
also gave several countermeasures against the attack, but those countermeasures 
require at least twice computing time than without them. 

In this paper, we propose a practical countermeasure against the address-bit 
DPA by randomizing addresses of registers for ECC and RSA. Our counter- 
measure does not change the scalar to be multiplied; an overhead is very small 
and the processing speed is as fast as before, namely, a scheme resistant against 
the data-bit DPA can be easily converted to that against the address-bit DPA 
with almost no penalty. The conversion can be applied not only binary methods 
but window-based methods. Moreover, we show the concrete security evaluation 
result of our countermeasure by theoretically and experimentally. 

An approach of our countermeasure is similar to Random Register Renaming 
(RRR), a DPA countermeasure proposed by May et al. [27]. RRR is supposed 
to be implemented on a processor called NDISC, which can execute instructions 
in parallel, while our countermeasure does not require special hardware because 
it can be implemented by only software with very simple program code. 

Side channel attack are so powerful that numerous countermeasures have 
been proposed (a brief overview is in [34]). However, some of them are proved 
or shown insecure by newer attacks. Finding a good countermeasure, which sat- 
isfies a certain security level and requires a compromisable processing speed, is 
becoming a hard task for implementers. In this paper, we give a complete com- 
parison of countermeasures from several view points including the security level 
and the processing speed. As a result, our proposed method (combined with 
other countermeasures) can provide a good practical solution for resisting the 
side channel attacks. 

In the following, we basically deal with scalar exponentiations in ECC, how- 
ever, most of algorithms, especially proposed countermeasures, can be applied to 
other exponentiation based cryptosystems such as RSA. The rest of this paper is 
organized as follows: In section 2, we give a brief overview of side channel attacks 
and countermeasures. Then section 3 describes our proposed countermeasure and 
experimental results. A comparison of countermeasures are in section 4. 

2 Preliminaries 

In this section, we give a brief introduction of Elliptic Curve based Cryptosys- 
tems (ECC) and side channel attacks (SCA) against them (however, most attacks 
can applied to other exponentiation based cryptosystems such as RSA). 



2.1 Elliptic Curve Based Cryptosystems (ECC) 

Elliptic curve based cryptosystems (ECC) are one of the standard technology 
in the area of cryptography [10,28,37]. The most advantage of ECC is the key 
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length; it is currently chosen much shorter than those of existing other cryp- 
tosystems (RSA and ElGamal). This feature is quite suitable for implementing 
on smart cards. 

Let AT be a finite field with elements a power of a prime. An elliptic curve 
over K can be represented by the Weierstrass equation 

E{K) := {{x, y) & K y. K \ + a\xy + a^y = x^ + a 2 X^ + a^x + 05} U O, 

( 1 ) 

where ai G K. A special point O is called the point of infinity. An elliptic curve 
E{K) has an additive group structure, in which an neutral element is the point 
of infinity. We call Pi + P 2 {Pi yf P 2 ) the elliptic curve addition (ECADD) and 
P 1 + P 2 {Pi = P 2 ), that is 2Pi, the elliptic curve doubling (ECDBL). Let d be an 
integer and P be a point on the elliptic curve E{K). A scalar exponentiation is to 
compute the point dP = P+--- + P{d—1 additions) . A dominant computation 
of all encryption/decryption and signature generation/ verification algorithms of 
ECC is the computation of dP, where d is a secret integer and P is a base point. 
Numerous researches have been dedicated to improve the computing time of this 
part (see [8] for a survey) by omitting unnecessary computations or by finding 
various short-cuts. 

Let d = -I h di2^ -I- do be a binary expression of d with d„_i = 

1. Then the binary method (from the most significant bit, MSB) for a scalar 
exponentiation is in Alg. 1. Similar method from the least significant bit (LSB) 
is easily constructed, but we omit this case in this paper for simplicity. 



INPUT: d[] , P 
OUTPUT: dP 
1: T[0] = P 

2: for i=n-2 downto 0 { 

3: T[0] = ECDBL (T[0]) 

4: if(d[i]==l){ 

5: T[0] = ECADD (T[0] ,P) 

6 : } 

7 : } 

8 : return T [0] 

Alg. 1. Binary method (from MSB) 



INPUT: d[] , P 
OUTPUT: dP 
1: T[0] = P 

2: for i=n-2 downto 0 { 

3: T[0] = ECDBL (T[0]) 

4: T[l] = ECADD(T[0] ,P) 

5: T[0] = T[d[i]] 

6 : } 

7: return T[0] 

Alg. 2. Add-and-double-always method 
(from MSB) 



2.2 Side Channel Attack 

The side channel attacks (SCA) are powerful attacks against implementations 
of cryptographic schemes on smart cards. An adversary observes side channel 
information such as computing time, or power consumption of the device. Then 
he/she tries to reveal the secret information by analyzing side channel informa- 
tion. The simple power analysis (SPA) [19] and the differential power analysis 
(DPA) [20] are typical examples of SCA. SPA only uses a single observed in- 
formation, while DPA uses a lot of observed information together with statistic 
tools. 
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Simple Power Analysis. The binary method of Alg. 1 computes ECADD only 
when the bit of the secret key is 1. Therefore an adversary easily detects this 
irregular procedure by observing side channel information and obtain the bit 
information of di. This is a basic idea of the simple power analysis (SPA) [19]. 

There are three approaches to resist SPA. The first one uses the special 
addition formula, in which ECDBL and ECADD are computed by same order of 
operations (the indistinguishable addition formula [6]). Brier- Joye proposed the 
formula for the Weierstrass form [3], but the security hole was pointed out in 
[14]. The second one uses so called the add- and- double- always method [5], (Alg. 
2 for example), in which both ECDBL and ECADD are computed in every bit 
(in step 3 and step 4), and a pattern of the side channel information is fixed. 
Thus the adversary cannot obtain the bit information of d by SPA. 



INPUT: d[] , P 
OUTPUT: dP 

1: T[0] = P, T[l] = ECDBL(P) 
2: for i=n-2 downto 0 { 

3: T[2] = ECDBL (T [d [i] ] ) 

4: T[l] = ECADD (T[0] ,T[1]) 

5: T[0] = T[2-d[i]] 

6: T[l] = T[l+d[i]] 

7 : } 

8 : return T [0] 

Alg. 3. Montgomery ladder 



INPUT: d[] , P 
OUTPUT: dP 
1: T[0] = RPC(P) 

2: for i=n-2 downto 0 { 

3: T[0] = ECDBL (T[0]) 

4: T[l] = ECADD(T[0] ,P) 

5: T[0] = T[d[i]] 

6 : } 

7: return invRPC(T[0]) 

Alg. 4. Add-and-double-always method 
(from MSB) and RPC 



The third approach to resist SPA is the Montgomery ladder [24] which essentially 
computes ECDBL and ECADD repeatedly [3,7,12,13,18,31,32]. This method can 
be viewed as a variant of the add-and-double-always method, but provides a good 
processing speed [12,13]. A sample algorithm of the Montgomery ladder is in Alg. 

3. 

Differential Power Analysis. Even if a scheme is SPA-resistant, it is not 
always secure, because the differential power analysis (DPA) might reveal the 
secret key by analyzing observed information statistically [20,25]. In DPA, an 
adversary makes an assumption on d{ {di = 0, for example) and simulates the 
computation repeatedly. Then he/she divides the side channel information into 
two groups depending on the assumption, in order to make the bias of the ham- 
ming weights of the internal information between these groups. If the assumption 
is correct, then a difference of the information of two groups (a spike) can be 
observed in the trace. Countermeasures against DPA aim to make the simulation 
impossible by using random numbers [5]. By the randomization we are able to 
enhance an SPA-resistant scheme to be DPA-resistant easily. Earlier DPA (the 
data-bit DPA (DDPA)) only considered a correlation of data of registers to the 
secret information [20,25], while newer DPA (the address-bit DPA (ADPA)) also 
considers a correlation of addresses of registers [11,26]. 

Data-bit DPA: DDPA analyzes a relation between the secret key and data of 
registers. For example, after finishing step 5 in Alg. 2, data of T[0] is same as 
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that of T[0] if di = 0; otherwise it is same as that of T[l] if di = 1. Coron 
proposed the randomized projective coordinate (RPC) countermeasure [5] in 
order to resist DDPA. Let P = {X ■. Y : Z) be a base point represented by 
the projective coordinate. Then (X : Y : Z) equals to {rX : rY : rZ) for all 
r G K* mathematically; but they all are different data as bit sequences. The 
side channel information of a scalar exponentiation is randomized if {X : Y : Z) 
is randomized to {rX : rY : rZ). An example of RPC for the add-and-double- 
always method (from MSB) is given in Alg. 4, where a function RPC outputs the 
randomized point and a function invRPC denotes its inverse map. 

Joye-Tymen proposed another countermeasure against DDPA [16], the ran- 
domized curve (RC) countermeasure, which uses an isomorphism of elliptic 
curves with which a curve equation and a base point are transformed with hold- 
ing the group structure. Two isomorphic curves are same mathematically; but 
different as bit sequences. Thus the side channel information will be random- 
ized if the curve and the base point is randomized. A sample algorithm is easily 
obtained similarly to Alg 4 by changing functions RPC.invRPC to RC.invRC. 

Instead of changing the expression of a base point, Coron also proposed 
the randomized exponent and the randomized base point countermeasures, in 
which the scalar is randomized. However, Okeya-Sakurai showed the bias of 
these countermeasures [32]. Messerges et al. proposed the randomized start point 
countermeasure, in which a scalar is divided to two parts and computed by 
different methods [25] . Oswald- Aigner proposed another approach to randomize 
the scalar [30], but security problems are pointed out [33,35,39]. Walter proposed 
the MIST algorithm [38], which randomizes the intermediate data T (and U) 
by repeating T = (di mod ri)U + T, U = riU and di+i = [di/ri\, where are 
random numbers and initial values of T, U, di are O, P, d respectively. Hasan 
proposed an approach by randomizing a scalar representation in the Koblitz 
curve [9]. 

Address-bit DPA: DDPA analyzes a relation between the secret key and data 
of registers [20,25]. Messerges et al. proposed the address-bit DPA (AD PA) for 
symmetric- key cryptosystems, which analyzes a relation between the key and 
addresses of registers [26]. Recently, Itoh et al. extended the analysis to public- 
key cryptosystems [11]. For example, in step 5 in Alg. 4, a correlation of data 
are given by T[0] ^ T[0] if dj = 0, and T[0] ^ T[l] if di = 1; as these operations 
are same, substituted data are loaded from different registers and AD PA detects 
this relation. They concluded that only the exponent splitting countermeasure 
[6], in which the scalar is divided into d = {d — r) r for a random number r, 
and the randomized window method [15] are resistant against the attack [11]. 
But as a drawback, required computing time become at least twice than that 
of without countermeasures. When a special hardware is available, RRR [27], 
proposed by May et al., resists AD PA. 

2.3 Window-Based Method 

In a scalar exponentiation, a pre-computed table sometimes deduces the com- 
puting time if extra registers are available (the window-based method). The most 
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INPUT: d[] , P 
OUTPUT: dP 


INPUT: d[] , P 
OUTPUT: dP 


1 


W[0] = 0, W[l] = P 


1 


W[0] = 0, W[l] = RPC(P) 


2 


W[2] = ECDBL(W[1]) 


2 


W[2] = ECDBL (¥[!]) 


3 


for i=3 upto 15 { 


3 


for i=3 upto 15 { 


4 


W[i] = ECADD(W[i-l] ,W[1] ) 


4 


W[i] = ECADD(W[i-l] ,W[1]) 


5 


} 


5 


} 


6 


T = W[d[n-l,n-4]] 


6 


T = W[d[n-l,n-4]] 


7 


for i=n-5 downto 0 step -4 { 


7 


for i=n-5 downto 0 step -4 { 


8 


T = ECDBL(T), T = ECDBL(T) 


8 


T = ECDBL (T), T = ECDBL (T) 


9 


T = ECDBL(T), T = ECDBL(T) 


9 


T = ECDBL (T), T = ECDBL (T) 


10 


T = ECADD (T,W[d[i,i-3]]) 


10 


T = ECADD (T, W [d [i, i-3] ] ) 


11 


} 


11 


} 


12 


return T 


12 


return invRPC(T) 



Alg. 5. 4-bit window method Alg. 6. 4-bit window method and RPC 



simplest example is in Alg. 5 with window size 4 just for simplicity (where n is 
supposed to be a multiple of 4). 

Similar to the binary method (Alg. 1), the window method is also vulnerable 
to SPA, because ECADD in step 10 is not computed if W[i,i-3] = O. Moller 
proposed a method to construct an addition chain in which W [] is never equal 
to O [22,23], which assures the SPA-resistance. One can combine RPC or RC 
countermeasures with Moller’s method (Alg. 6 for RPC case) in order to resist 
DDPA. However, Okeya-Sakurai showed the insecureness against the second- 
order data-bit DPA [34]. Other approach to resist both DDPA and ADPA is 
proposed by Itoh et al. [15], which randomize the window to be added in step 
10 in Alg. 6. 



3 Proposed Countermeasure 

In this section, we propose a practical countermeasure, the randomized address- 
ing method, against the address-bit DPA by randomizing addresses of registers 
used in a scalar exponentiation. An overhead of the countermeasure is small and 
the effectiveness will be shown by experimental results described in this section. 

An approach of our countermeasure is similar to that of Randomized Register 
Renaming (RRR), a hardware-based DPA countermeasure proposed by May et 
al. [27]. However, implementation method is quite different. That is, our coun- 
termeasure can be implemented on various processors because it requires no 
special hardware, and can be implemented by only software with very simple 
program codes. These specifications make our countermeasure very practical. 
On the other hand, when a program code is implemented on the processor with 
RRR, physical registers are randomly chosen by hardware as far as the com- 
putation result is unchanged. Because execution timing, instruction order and 
physical registers are randomly changed, RRR is secure against DPA. However 
they did not show concrete results of security evaluation. 
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3.1 Outline 

The address-bit DPA is based on the dependency of addresses of registers on the 
secret key. In other word, addresses of registers are determined by the secret key 
uniquely, because in Alg. 4, for example, T[0] ^ T[0] if = 0 and T[0] ^ T[l] 
if di = 1 so that if di changes, then registers will change, too. 

In order to resist AD PA, previous countermeasures randomize the scalar 
value [11]. However, the weakness lies on the direct correlation between the secret 
key and addresses of registers. What we should hide is this relation rather than 
the scalar value. So we randomize addresses of registers by a one-time random 
number -I- • • • -I- ri2 -|- tq (r^ G {0, 1}). We change all parameters di 

to di® Ti, where © denotes the XOR operation. Then all addresses of registers 
are randomized so that the side channel information will be randomized for each 
scalar exponentiation. This is a basic idea of our proposing countermeasure, the 
randomized addressing method (RA), against AD PA. 

The most advantage of our method is the small overhead; the random number 
is easily generated and required additional operations are just XORs. However, 
our countermeasure has no DDPA-resistance. We have to combine other coun- 
termeasures to resist all side channel attacks. A DDPA-resistant scheme can be 
converted to an ADPA-scheme with almost no cost. Moreover, our countermea- 
sure can be easily applied to window methods. 



3.2 Description of Algorithms 

Example algorithms of our countermeasure combined with the add-and-double- 
always method, the Montgomery ladder, and a window method (with 4-bit win- 
dow) are in Alg. 7-9. All sample algorithms are resistant against SCA, namely 
SPA, DDPA, and ADPA. Here r„_i2”“^ + - • - + ri2 + ro (r, G {0, 1}) is a random 
number and r in Alg. 9 is a 4-bit random number. The functions R, invR denote 
RPC.invRPC or RC,invR described in section 2.2, respectively. 



INPUT: d[] , P 
OUTPUT: dP 


INPUT: d[] 
OUTPUT: dP 


, P 


1 


T[2] = R(P) 


1 


T[r[n-1]] = R(P) 


2 


T[r[n-1]] = T[2] 


2 


1 1 

u 

1 

H 


-1]] = ECDBL(T[r[n-l]]) 


3 


for i=n-2 downto 0 { 


3 


for i=n 


-2 downto 0 { 


4 


T[r[i+1]] = ECDBL(T[r[i+l]]) 


4 


T[2] 


= ECDBL(T[d[i]©r[i+l]]) 


5 


II 

( 1 

1 1 

T— 1 
+ 
■H 

1 1 

U 

1 

1 1 

H 


5 


T[l] 


= ECADD(T[T[0] ,T[1]) 




ECADD(T[r [i+l]] ,T[2] ) 


6 


T[0] 


= T[2-(d[i]©r[i])] 


6 


T[r[i]] = T[d[i]©r [i+l]] 


7 


T[l] 


= T[l+(d[i]©r[i])] 


7 


} 


8 


} 




8 


return invR(T [r [0] ] ) 


9 


return 


invR(T [r [0] ] ) 



Alg. 7. Proposed countermeasure (add- Alg. 8. Proposed countermeasure (Mont- 
and-double-always method from MSB) gomery ladder) 



Note 1. In the above algorithms, a random number r can be computed on the 
fly for efficiency rather than generated and stored in memory at the beginning. 
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INPUT: d[] , P 
OUTPUT: dP 

1: W[0] = 0, W[l©r] = P 
2: W[2©r] = ECDBL(W[l©r] ) 

3: for i=3 upto 15 { 

4: W[i©r] = ECADD(W[(i-l)©r] ,W[l©r]) 

5: } 

6: T = W [d [n-1 ,n-4] ©r] 

7 : for i=n-5 downto 0 step -4 { 

8: T = ECDBL(T), T = ECDBL(T) 

9: T = ECDBL(T), T = ECDBL(T) 

10: T = ECADD(T,W[d[i,i-3]©r]) 

11 : } 

12: return invR(T) 

Alg. 9. Proposed countermeasure (4-bit window method) 

Note 2. A similar countermeasure by randomizing registers is also proposed by 
May et al. [27]. Their approach is to construct a specialized hardware, while ours 
is in the software level. 

3.3 Security Analysis 

Let us discuss the security of our countermeasure. Basically our scheme is de- 
signed to combine other countermeasures to totally resist SCA; we only consider 
the security against ADPA. In Alg. 7-9, addresses of registers are determined 
by di (B Ti- ADPA can distinguish whether two addresses corresponding to di 
and dj are same or not. If these addresses are same, an adversary can know 
di © Tj = dj © rj. But he/she cannot determine di = dj or not, because rj, rj are 
chosen randomly. Conversely, even if di = dj, addresses are not always same by 
Vi and Tj. Thus addresses of registers are randomized and our countermeasure 
is secure against ADPA. 

3.4 Experimental Results 

We performed an experiment for verifying the effect on the security by using our 
countermeasure. We used the address-bit DPA attack against an implementation 
of Montgomery ladder using RPC with and without the register randomization. 
In the experiment, the target processor was run at 10 MHz, the sampling ratio 
was set to 100 MHz, and we made a differential power trace as 

(means of 10000 traces when loading Q[addr[da\\) 

— (means of 10000 traces when loading Q[addr[di,]]) 

for da yf db where addr[dx](x = a or b) represents the address value determined 
from dx in each implementation. Fig.l shows a differential power trace without 
register randomization, and Fig. 2 shows that with register randomization. In 
Fig.l, some spikes showing the evidence for da dt are observed, but they are 
not found in Fig. 2. Hence we confirmed the effect of our countermeasure for 
protecting against address-bit DPA attack. 
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Fig. 1. Differential power traces without register randomization 
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Fig. 2. Differential power traces with register randomization (proposed method) 



4 Comparison 

In order to resist the side channel attacks, numerous numbers of countermea- 
sures have been proposed. However, finding a good countermeasure which has a 
certain security level and compromisable processing speed is a hard task for im- 
plementors, for there were no standard rule to compare these countermeasures. 
In this section, we provide a complete comparison of these countermeasures from 
viewpoints of security level, processing speed, and amount of required registers. 
Moreover combined countermeasures can be evaluated by the comparison. As a 
result, our proposed countermeasure can be combined to establish a practical 
solution for resisting side channel attacks in real world. 
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4.1 Evaluation Technique 

We examine an exponentiation in 160-bit ECC from the security level, processing 
speed, and amount of required registers. In the followings, the binary methods 
and the Montgomery Ladder are called the base algorithms to which countermea- 
sures are applied. We did not discuss the window based cases and the methods 
unavailable in the generic elliptic curve for space limitation. 

Security Level. The security of a countermeasure against each of SPA, DDPA, 
and AD PA is measured by the Attenuation Ratio (AR)^ proposed by Itoh et al. 
[15]. AR is given by a ratio of heights of spikes with and without the counter- 
measure. We denote the ratio against SPA, DDPA and ADPA by AR5, AR^ 
and ARq respectively. All AR ratios are between 0 and 1 and AR is desired to 
be smaller. If AR = 0, an adversary cannot observe spikes by any cost and the 
method is secure; if AR = 1, the adversary can always observe spikes and the 
method is totally insecure. ARg takes 0 or 1, while AR^ and ARa take arbitrary 
value from the interval [0, 1]. 

Processing Speed. Processing speed of a scalar exponentiation dP is measured 
by the numbers of ECADDs denoted and ECDBLs denoted N^. For the 
base algorithms, N jj and are given by integers. If a countermeasure requires 

ECADDs a times than before, we denote it by ” x a” . We ignore the cost for 
randomizations or transformations required in the countermeasure because they 
are relatively small compared to and N^. 

Register. Amount of required registers are measured by the number of points 
Rp and that of scalars R^ in a scalar exponentiation. We assume all points 
are represented by the projective coordinate just for simplicity. For the base 
algorithms, Rp and Rg are given by integers. For countermeasures, Rp and Rg 
are evaluated as extra points and scalars required in the exponentiation. If a 
countermeasure requires b extra points than before, Rp is denoted by ”-|-5”. We 
ignore the temporary registers for EC ADD and ECDBL. 



4.2 Countermeasures 

In this section, we evaluate these values for each countermeasure. We only deal 
with countermeasures whose recommended parameters are explicitly given in 
the original papers, so we exclude RRR and some other countermeasures. We 
did not deal with the randomized addition-subtraction chain [30] because it is 
shown to be insecure [33,35,39]. 

SPA Countermeasure. The add-and-double-always method (Alg. 1) com- 
putes ECDBL and EC ADD for every bit and is SPA-resistant (AR5 = 0). But 
data computed in step 5 is predictable (AR<j = 1) and a correlation of addresses 

^ AR is originally introduced to evaluate the security against DDPA, however it is 
easily to be applied to other attacks. 
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in step 5 is related to d (AR^ = 1). Required ECDBLs are same (N^i = xl) but 
ECADDs are doubled in average (N^ = x2). Extra points and scalars are not 
required (Rp = +0, R^ = +0). Same discussion can be applied to the LSB case. 

The Montgomery Ladder (Alg. 3) computes ECDBL and ECADD for every 
bit and is SPA-resistant (AR 5 = 0). But data computed in step 5 and 6 are 
predictable (AR^ = 1) and a correlation of addresses of registers in step 3,5,6 
are related to d (AR^ = 1). Required ECDBLs and ECADDs are 160 (Np = 

= 160). Required registers are Rp = 3, R« = 1. 

DPA Countermeasure by Data Randomization. In the Randomized Pro- 
jective Coordinate (RPC) [5], a point {X, Y, Z) is randomized to (rX, rY, rZ) by 
a 160-bit random number r and it resists DDPA (AR^ = 2“^®°). As it is vulnera- 
ble to SPA (AR 5 = 1), it must be used with the add-and-double-always method 
or the Montgomery Ladder. It is also vulnerable to ADPA (AR^ = 1). Processing 
speed is unchanged (N^ = Np = xl). An extra point for the randomized point 
and an extra scalar for random number are required (Rp = -1-1, Rg = -1-1). 

In the Randomized Curve (RC) [16], a point (A, Y, Z) is randomized to 
(r^A, r®Y, Z) on an isomorphic curve by a 160-bit random number r and it 
resists DDPA (AR^ = 2“^®®). But it is vulnerable to SPA and ADPA (AR 5 = 
ARa = 1). Processing speed is unchanged (N^ = Np = xl). Extra scalars for r 
and coefficients of isomorphic curves are required (Rp = -1-0, Rg = -1-3). 

In the Randomized Base Point [5], a scalar exponentiation dP is computed 
by d{P+ R) — dR for a random point R. As R is chosen for each exponentiation, 
the method is DDPA-resistant (AR^ = 2“^®®). But it is vulnerable to SPA and 
ADPA (AR 5 = ARa = 1). The countermeasure requires two exponentiations 
and an extra ECADD (Np = x2, = x2 -|- 1). Extra registers for R and 

P + R are required (Rp = +2, Rg = -|-0). 

DPA Countermeasure by Scalar Randomization. In the Randomized Ex- 
ponent [5] , a scalar d is randomized to d + rcj) for a random number r and the 
order (p of the base point P. In the original paper, the length of r is 20-bit 
and, thus, the countermeasure is DDPA-resistant (AR^ = 2“^®) (There is an 
analysis which claims the 20-bit randomization is not sufficient [32]. Of course, 
we can relax the condition to 160-bit, however, the processing speed becomes 
much slower). It is also ADPA-resistant (AR^ = 0), however it is vulnerable 
to SPA (ARp = 1). Processing speed becomes slower for the scalar is 20-bit 
longer (Np = N^i = x 180/160 = xl.l3). An extra register for r is required 
(Rp = -|-0, Rg = -l-l). 

In the Randomized Start Point [25], a start bit is chosen from a 160-bit scalar 
and an exponentiation is computed from the chosen bit by MSB for upper bits 
and by LSB for lower bits. However, the effect is rather small (AR^; = 1/160 = 
2“^®). It is also ADPA-resistant (AR^ = 0), however it is vulnerable to SPA 
(ARs = 1). There requires no extra process (Np = N^ = xl) and register 
(Rp = Rg = -|-0). 

In the Exponent Splitting [ 6 ], a scalar d is divided into r and d — r for a 160- 
bit random number r. As the secret information d is randomized, it is resistant 
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Table 1. Comparison of countermeasures (non- window methods) 



No. 


Method 


ARs 


ARd 


ar„ 


Np 


Na 


Rp 


Rs 


1 


binary method (form MSB) 


1 


1 


1 


160 


80 


1 


0 


2 


binary method (from LSB) 


1 


1 


1 


160 


80 


2 


0 


3 


add-and-double-always method 


0 


1 


1 


xl 


x2 


+1 


+0 


4 


Montgomery ladder 


0 


1 


1 


160 


160 


3 


0 


5 


Randomized Projective Coordinate (RPC) 


1 


2-160 


1 


xl 


xl 


+1 


+1 


6 


Randomized Curve (RC) 


1 


2-160 


1 


xl 


xl 


+0 


+3 


7 


Randomized Base Point 


1 


2-160 


1 


x2 


x2 


+2 


+0 


8 


Randomized Exponent (|r| = 20) 


1 


2-20 


0 


xl.l3 


xl.l3 


+0 


+1 


9 


Randomized Start Point 


1 


2-7.3 


0 


xl 


xl 


+0 


+0 


10 


Exponent Splitting 


1 


2-160 


0 


x2 


x2 + l 


-hi 


+2 


11 


Randomized Addressing (RA) 


1 


1 


0 


xl 


xl 


+0 


+2 


1+3+5+11 


Best Combination 


0 


2 ^PW 


0 


160 


160 


3 


3 


1+3+6+11 


Best Combination 


0 


2-160 


0 


160 


160 


2 


5 


4+5+11 


Other Solution 


0 


2-160 


0 


160 


160 


4 


3 


4+6+11 


Other Solution 


0 


2-160 


0 


160 


160 


3 


5 



against DDPA and ADPA (ARa = 0, AR^ = 2“^®°). However, it is vulnerable 
to SPA (AR 5 = 1). The countermeasure requires two exponentiations and an 
extra ECADD (N^i = x2, = x2 -|- 1). Extra registers for {d — r)P, r, d are 

required (Rp = -|-1, Rs = +2). 

In the Randomized Addressing (RA) (Alg. 7), proposed in this paper, reg- 
isters in step 4,5,6 are determined by a random bit rp It is ADPA-resistant 
(ARa = 0), however, it is vulnerable to SPA and DDPA (ARg = AR^ = 1). 
There requires no extra process (Np = = xl). Extra registers for r and 

r (B d are required (Rp = -1-0, Rg = -1-2). 



4.3 Combining Countermeasures 

As we saw in the previous section, some all countermeasures only resist specific 
attacks. Implementers should combine them to resist all side channel attacks. In 
our setting, we can easily evaluate and compare each combined countermeasures. 
The security level ARg, AR^, AR^ can be evaluated by their product. Similarly, 
the processing speed is evaluated by their product and amount of registers are 
evaluated by their sum. For example, if we use the Montgomery Ladder combined 
with RPC and RA, which is denoted as 4-1-5-1-11 in the following Table 1, the 
security levels, processing speed and amount of registers are given by ARg = 
0, ARd = 2-160, ARa = 0, Np = 160, = 160, Rp = 4, R, = 3. 

4.4 Comparison 

This section provides a complete comparison of countermeasures (Table 1). The 
base algorithms are described in italic. As in the previous section, the security 
level can be evaluated for combined countermeasures. By this table, we can easily 
choose the most suitable countermeasure (s) from the security level, processing 
speed and amount of required registers. 
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From Table 1, we can conclude that combinations 1+3+5+11 and 1+3+6+11 
provide the best combination from security level and processing speed. Other 
possible solutions are 4+5+11 and 4+6+11. In these combinations, more effective 
addition formula for ECADD and ECDBL are applicable and faster processing 
speed and fewer amount of registers can be expected [13]. In these combinations, 
our proposed countermeasure are used. 

Note 3. Similar discussion for window based methods can be established. We do 
not compare them here for a space limitation. 

5 Concluding Remarks 

In this paper, a practical countermeasure, the randomized addressing method, 
against the address-bit DPA, was proposed. As an overhead is quite small, the 
method provides no slower scalar exponentiation algorithm with improving the 
security. Although an approach of our proposal is similar to the previous work by 
May et al. [27], implementational methodology is quite different. Our approach 
requires no special hardware and can be implemented on various processors with 
very simple program codes. Moreover, we showed a concrete security evaluation 
results by theoretically and experimentally. 

In order to resist the side channel attacks, considering countermeasures 
against each attack is an important factor for implementers. However, when 
they are to establish a total security, combinations of some countermeasures are 
more important. For this purpose, our comparison table (Table 1) will be a great 
help. 
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Abstract. Elliptic curve cryptosystem (ECC) is well-suited for the 
implementation on memory constraint environments due to its small 
key size. However, side channel attacks (SCA) can break the secret key 
of ECC on such devices, if the implementation method is not carefully 
considered. The scalar multiplication of ECC is particularly vulnerable 
to the SCA. In this paper we propose an SCA-resistant scalar multiplica- 
tion method that is allowed to take any number of pre-computed points. 
The proposed scheme essentially intends to resist the simple power 
analysis (SPA), not the differential power analysis (DPA). Therefore it is 
different from the other schemes designed for resisting the DPA. The pre- 
vious SPA-countermeasures based on window methods utilize the fixed 
pattern windows, so that they only take discrete table size. The optimal 
size is 2™“^ for ui = 2, 3, ..., which was proposed by Okeya and Takagi. 
We play a different approach from them. The key idea is randomly 
(but with fixed probability) to generate two different patterns based 
on pre-computed points. The two distributions are indistinguishable 
from the view point of the SPA. The proposed probabilistic scheme pro- 
vides us more flexibility for generating the pre-computed points — the 
designer of smart cards can freely choose the table size without restraint. 

Keywords: Elliptic Curve Cryptosystem, Side Channel Attacks, Width- 
w NAF, Fractional window. Pre-computation Table, Smart Card, Mem- 
ory Constraint 



1 Introduction 

We are standing to the beginning of the ubiquitous computing era. It is ex- 
pected that we can accomplish lucrative applications by effectively synthesizing 
the ubiquitous computer with cryptography. The ubiquitous computer only has 
scarce computational environments, so that we have to make an effort to opti- 
mize the memory and efficiency of the cryptosystem. Elliptic curve cryptosystem 
(ECC) is suitable for the purpose because of its short key size [Kob87,Mil86]. 
However, several experimental tests show that side channel attacks (SCA) can 
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break the ECC if the implementation on the devices is not carefully considered 
[Cor99,HaM02,IIT02], 

The SCA tries to find a correlation between the side channel information and 
the operation related to the secret key. In this paper we discuss the SCA on ECC 
using the power analysis [KJJ99], which consists of the simple power analysis 
(SPA) and the differential power analysis (DPA). The SPA simply observes sev- 
eral power consumptions of the device, and the DPA is additionally allowed to 
use a statistical tool in order to guess the secret information. An SPA-resistant 
scheme can be converted to be a DPA-resistant one by randomizing the param- 
eters of the underlying system (See for example [Cor99,JT01]). There are three 
different types of SPA-resistant scheme: (1) indistinguishable addition formula 
that uses one formula for both of elliptic addition and doubling [LS01,JQ01, 
BJ02]. (2) addition chain that always computes elliptic addition and doubling 
for each bit [Cor99,OS00,FGKS02,IT02,BJ02]. (3) window based addition chain 
with fixed pattern [M6101a,M6101b,M6102a,OT03]. In this paper we deal with 
the third category. The optimal one in (3) is the scheme proposed by Okeya and 
Takagi [OT03]. 

We intend to propose an SPA-resistant scalar multiplication that allows us to 
choose any number of the pre-computed points. We try to reduce the table size 
of the Okeya-Takagi scheme using the fractional window method proposed by 
Moller [M6102b] . The fractional window method reduces a part of pre-computed 
points to smaller window size. Therefore, the table length is not fixed anymore, 
and the corresponding addition chain has no fixed pattern. It is not obvious to 
construct an SPA-resistant scheme using the fractional window method. In order 
to overcome this bias we propose a novel approach. We generate the points with 
smaller window size as the probabilistic process, which are indistinguishable from 
the view point of the SPA. Indeed, all points in the table are classified: {lower) 
points (i.e. uP, u < 2““^) and (upper) points (i.e. uP, u > 2““^), where w and P 
is the underlying width and the base point, respectively. We control the reduction 
probability of (lower) based on that of (upper), namely the distribution of both 
(lower) and (upper) are indistinguishable against SPA. The pre-computed points 
for (upper) are randomly chosen for every scalar multiplication, and the points in 
class (lower) are randomly reduced with the above reduction probability. Thus 
the SPA cannot detect which point is used in each class (upper) and (lower). 

In order to implement highly functional applications on memory constraint 
device such as smartcards, the cryptographic functions are usually required to 
be efficient and to use small memory. In addition, some applications are often 
appended to (deleted from) the smartcards, thus the memory space allowed to 
use for the cryptographic functions depends on such individual situations. Hence 
the cryptographic schemes should be optimized on such individual situations. 
The proposed scheme attains the SPA-resistant scheme with any size of the pre- 
computed table. The designer of ECC can flexibly choose the table size suitable 
for the smartcards. 

This paper is organized as follows: In Section 2 we review the scalar multipli- 
cation of elliptic curves. The width-w NAF and the fractional window method 
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are reviewed. In Section 3 the side channel attacks are discussed. The fast and 
memory efficient countermeasures are presented. In Section 4 we show the pro- 
posed scheme. The security and efficiency are discussed. In Section 5 we conclude 
the result of our paper. 



2 Scalar Multiplication of ECC 

In this section we review the scalar multiplication of elliptic curve cryptosys- 
tem (ECC). The width-ru non-adjacent form (NAF) and the fractional window 
method are discussed 

The scalar multiplication computes dP for a point P on the elliptic curve 
and a scalar d. A lot of algorithms of computing the scalar multiplication 
have been proposed. Because the inverse of a point P can be computed with 
little additional cost, the signed representation of d is usually deployed. The 
fastest method with less memory is the width-rc non-adjacent form (NAF). The 
width- ru NAF represents an n-bit integer d = X^r=o where dw[i] are 
odd integers with |du,[i]| < 2““^ and there are at most one non-negative digit 
among w-consecutive digits. Therefore, we pre-compute the table with points 
P, 3P , .., (2*"“^ — 1)P, which has 2“'“^ points including base point P. The points 
with the opposite sign are generated on the fly during the scalar multiplication. 



Generating_Width-w_NAF Scalar_Multiplication_with_Width-w_NAF 



INPUT An n-bit d, a width w 
OUTPUT d^[n],d^[n-l],...,d^ [0] 


INPUT d^[i], P, i\d^mp 
OUTPUT dP 




1. t ^ 0 


1. Q i — d^[c]P 




2. While d > 0 do the following 


for the largest c with d^ 


[cjy^O 


2.1. if d is odd then do following 


2. For i = c — 1 to 0 




2.1.1. dw[i] ^ d mods 2*" 


2.1. Q ^ ECDBL(Q) 




2.1.2. d i — d — d.y; [t] 


2.2. if dM yf 0 




2.2. else dw[i\ ^ 0 


then Q ^ ECADD(Q, d„[z]P) 


2.3. d i — dj‘2^ i i — i -F 1 


3. Return Q 




3: Return dw[n],dw[n — 1], ..., du,[0] 







Several methods for generating the width-w NAF have been proposed [KT92], 
[MOC97], [BSS99], [SolOO]. Generating_Width-w_NAF is an algorithm that gener- 
ates the width-w NAF proposed by Solinas [SolOO]. Notation “mods 2“” at Step 
2.1.1 stands for the signed residue modulo 2“, namely ±1,±3, ..,±(2’"“^ — 1). 
Note that the next {w— 1) consecutive bits of non-zero bits in the width-w NAF 
are always zero. It is known that the density of the non-zero bits of the width-rc 
NAF is asymptotically equal to 1/(1 -F w). 

Scalar_Multiplication_with_Width-w_NAF is an algorithm of computing the 
scalar multiplication using the width-w NAF. It is calculated from the most 
significant bit — elliptic curve doubling (ECDBL) at Step 2.1 is executed for 
each bit and elliptic curve addition (ECADD) at Step 2.2 is executed if and only 
if dw\i] is non-zero. Therefore we have to compute (c -F l)-time ECDBLs and 
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(c + 1)/(1 + w)-time ECADDs, where c is the largest integer with dw[c\ ^ 0. 
If we choose larger width w, then the scalar multiplication becomes faster, but 
with more memory. 



2.1 Fractional Width-ic NAF 

The width-w NAF uses the table P, 3P, (2’"“^ — 1)P. The size of the table takes 
discrete values 1, 2, 4, 8, ... for w = 2, 3, 4, ... The density of non-zero bits of the 
width- re NAF also takes the discrete values l/(w-|-l). In order to interpolate their 
intermediate values, Moller discussed how to construct the NAF with fractional 
widths [M6102b]. His idea is to utilize the degenerated width-w NAF — some 
table values of the width-ru NAF are not pre-computed^, where w > 3. We call 
it the fractional width-w NAF in this paper. 

The fractional width-w NAF can be easily generated by modifying Generat- 
ing_Width-w_NAF. Indeed we insert the following step between Step 2.1.1 and 
Step 2.1.2: 

if |du)[*]| > 2““^ -I- B then dw[i] dw[i] mods 2™“^, 

where B is an integer 0 < B < 2“'“^ that determines the table size and the 
efficiency between width- li; and width- (ru — 1) NAF. If we choose H = 0 or 
B = 2’"“^, then it becomes the width-(w — 1) or width-w NAF, respectively. 

We define the width-w suitable for our paper in the following. Let w = 
(wo — 1) -I- wi, where wq — 1 and wi are the integral and fractional parts^ of 
w, respectively; wg = [ic], = w — (wg — 1). The fractional part wi takes one 
of 1/2“'““^, 2/2“’‘>“^, ..., Here the pre-computed 

points are P, 3P, .., (2“o"i - 1)P, (2“o-i -F 1)P, ..., (2“o-i -F wi2’"o-i - 1)P. There 
are (1 -F rci)2™““^ points. The non-zero density of the fractional width-ic NAF 
is 1/(1 -F w). The scalar multiplication using the fractional width-rc NAF is 
computed as same for Scalar_Multiplication_with_Width-w_NAF. 

3 Side Channel Attacks and Their Countermeasures 

In this section we review side channel attacks and their countermeasures. 

Side channel attacks (SCA) are allowed to access the additional information 
linked to the operations using the secret key, e.g., timings, power consumptions, 

^ Strictly speaking, Moller’s idea is as follows: Some values of the width-(ty -F 1) NAF 
are appended to the table. This enhances the speed but additional memory is re- 
quired. On the contrary, the degenerated width-ui NAF provides efficient memory 
but reduces the speed. In other words, speed and memory have a trade-off relation. 
The expression in this paper is different from that in [M6102b], however, they are 
equivalent in this point. We use the former for the sake of the description of the 
proposed scheme in the following sections. 

^ We may define w = Wq + Wi, Wg = [wj,wi = w — Wq- For the sake of simplicity 
and the easiness of the comparison between original and proposed schemes in the 
following sections, we use the notations of the former. 
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etc. The attack aims at guessing the secret key (or some related information). 
Scalar_Multiplication_with_Width-w_NAF can be broken by the SCA. It calculates 
the ECADD if and only if the z-th bit is not zero. The standard implementation 
of ECADD is different from that of ECDBL, and thus the ECADD in the scalar 
multiplication can be detected using SCA. 

If the attacker is allowed to observe the side channel information only a few 
times, it is called the simple power analysis (SPA). If the attacker can analyze 
several side channel information using a statistical tool, it is called the differential 
power analysis (DPA). The standard DPA utilizes the correlation function that 
can distinguish whether a specific bit is related to the observed calculation. In 
order to resist DPA, we need to randomize the parameters of elliptic curves. 

There are three standard randomizations [Cor99,JT01]: (l)the base point is 
masked by a random point, (2)the secret scalar is randomized with multiplier of 
the order of the curve. (3)the base point is randomized in the projective coordi- 
nate (or Jacobian coordinate). Some attacks or weak classes against each coun- 
termeasure have been proposed [Gou03,OS00]. However, if these randomization 
methods are simultaneously used, no attack is known to break the combined 
scheme. In other words, SPA-resistant schemes can be easily converted to be 
DPA-resistant ones using these randomizations. 

On the contrary, there are some schemes which try to achieve the SPA- and 
DPA-resistance simultaneously without using the combinations, e.g. random- 
ized window methods [Wal02a,IYTT02,LS01,OA01], etc. The security of these 
schemes causes many controversies — some of them have been broken [OS02a, 
Wal02b,Wal03a] or less secure than expected [Wal03b]. Therefore we are inter- 
ested in the SPA-resistant schemes. 



3.1 SPA-Resistant Methods 

We review the SPA-resistant schemes of computing the scalar multiplication. 

There are three different approaches to resist the SPA. We explain these 
schemes in the following. (l)We construct the indistinguishable addition for- 
mula [LS01,JQ01,BJ02]. (2)We use the addition formula that always computes 
ECADD and ECDBL for each bit [Cor99,OS00,FCKS02,IT02,BJ02]. (3) We gen- 
erate the addition chain with fixed pattern [M6101a,M6101b,M6102a,OT03]. 

(l)Whereas the indistinguishable addition formula conceals addition and 
doubling, the attacker can detect the number of additions and doublings in 
the computation. In other words, the indistinguishable addition formula pulls 
the SPA back to the timing attack. Hence, this type is imperfect. (2)Since 
the second type does not compute the pre-computed points, the memory 
consumption is small. In addition, if we are allowed to use a special form such 
as Montgomery-form, this type is the fastest. However, some international 
standards [ANSI, IEEE, NIST, SEC] do not support such a form. Without using 
the special form, this type is not so fast because it requires many ECADD 
operations. (3)The third type utilizes pre-computed points for speeding up the 
computation, since the pre-computed points reduce the number of ECADD 
operations. Whereas the large number of the pre-computed points achieves a 




402 K. Okeya and T. Takagi 



fast computation, it requires large memory for storing the points. Okeya and 
Takagi proposed an SPA-resistant addition chain with small memory, which is 
based on the width- re NAF [OT03]. The algorithm is as follows (We modify it 
suitable for our proposed scheme): 

SPA-resistant_Width-w_NAF_with_Odd_Scalar 
INPUT An odd n-bit d 

OUTPUT d^[n],dn,[n-l],...,d^ [0] 

1. r ^ 0, z ^ 0, ro ^ w 

2. While d > 1 do the following 

2.1. u[i] ^ (d mod 2“+i) - 2^ 

2.2. d^ (d-M[z])/2’'‘ 

2.3. dyj [r -|- Tj — 1] •<— 0, dyj [r -I- — 2] •<— 0, ..., d^, [r -I- 1] •<— 0, dyj [r] •<— u\i\ 

2.4. r r + Ti, i -1^ i + 1, Ti w 

3. du,[n] •<— 0, ..., du)[r -I- 1] 0, du,[r] •<— 1 

4. Return dw[n],d^[n — 1], ..., du,[0] 



The algorithm generates the SPA-resistant chain only for odd scalar, and the 
treatment for even scalar was discussed in [OT03]. We assume that the scalar 
d is odd in the following. At Step 2.1, the integer u[i] is assigned as (d mod 
2™+i) _ 2*". The computation assures that u[i] is odd whenever d is odd. Since 
d — u[i] = d — {d mod 2“+^) -|- 2*" = 2™ mod 2“+^, the resultant (d — u[z])/2™ 
is odd. Thus, each integer u[i] is odd. Note that d terminates with d = 1. Hence 
we can achieve the SPA-resistant chain, e.g., the fixed pattern 

|^0^a;|_0^a:|...|_0^x| with odd integers \x\ < 2™. 

10—1 10 — 1 10 — 1 

The number of the pre-computed points is 2™“^, and the density of the non-zero 
bit is \/w. The scalar multiplication using this chain is computed as same for 
Scalar_Multiplication_with_Width-w_NAF. 

Note that this scheme is optimal in respect of the memory, and the table size 
takes 1,2,4, 8 ,... for w = 1,2, 3, 4,.... If the designer of smart cards allows to 
use the table sizes 1, 2, 4, 8, ..., this scheme is one of the best solutions. However, 
if he allows to use just the sizes 3,5,6,... not 1,2, 4, 8,..., it compromises the 
memory and/or the speed. This situation often occurs because some restrictions 
about resources such as memory and cost are determined by the applications 
of the smart cards, not the specifications of the cryptographic schemes. Such 
restrictions impose the flexibility on the cryptographic schemes. Hence, we need 
to construct such a scheme. 

4 Proposed Scheme 

In this section we propose a new SPA-resistant scheme with any table size. After 
describing its main idea, we present the details of our algorithm. We then discuss 
the security, the efficiency, and the memory requirement of our proposed scheme. 
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4.1 Main Idea 

We describe the main idea of our proposed algorithm. The proposed scheme is 
converted from SPA-resistant_Width-w_NAF_with_Odd_Scalar using the idea of the 
fractional window method. 

First, we discuss the security of the straight-forwardly combined scheme be- 
tween Okeya-Takagi scheme [OT03] and the fractional window method [M6102b] , 
and find that this combined scheme is not secure against SPA. For the sake of 
simplicity we explain it with w = 4. SPA-resistant_Width-w_NAF_with_Odd_Scalar 
with w = 4 pre-computes the signed odd integer modulo 2“', i.e., Uw = 
{±1, ±3, ±5, ±7, ±9, ±11, ±13, ±15}. The fractional width-ic NAF can reduce it 
to smaller one, for instance, F = {±1, ±3, ±5, ±7, ±9}. Note that it still contains 
the representations of the smaller modulus 2*"“^, i.e., Uw-i = {±1, ±3, ±5, ±7}. 
The fractional window using class F is constructed by inserting the following 
step between Step 2.1 and 2.2: 

if |u[t]| > 2™“^ -F B, then u[i] ^ (m[z] mod 2*") — 2““^, ^ w — 1, 

where B is an integer 0 < i? < 2™“^ (in the case of F we choose B = 1). However, 
sequence dw[n], dyj[n — 1], ..., di„[0] generated by this fractional window method 
has no fixed pattern, so that it is not secure against the SPA. Indeed, we know 
|w[f] I > 2™“^ + B ii and only if {w — 2)-consecutive zeros (i.e. ri = w — 1) appear. 
In order to overcome this bias, we propose two novel ideas in the following. 

The first one is to control the choice of two moduli 2™ and 2’"“^ as the 
probabilistic process. We reduce u[i\ with the uniform probability from the view 
point of the SPA. Since u[i] with |u[f]| < 2*"“^ is possible to utilize both moduli 
2™ and 2“'“^, the use of the following trick achieves our aim: 

If |w[i]| < 2“-\ 

then u[i] ^ (^[i] mod 2*") — 2““^, ^ ru — 1 with probability 1 — P„,, 

where Pw is the probability that u[i] within Uw-i remains the representation of 
mod 2“, and we should select ■ This means that we reduce u[i] 

to the representation of mod 2™“^ with the same probability for both < 
2““^ and |u[i]| > 2™“^. Thus the SPA cannot distinguish the two distributions. 

The second idea is to use a different representation of residue class modulo 2™. 
The use of the different representation conceals the information that a specific 
u[i] belongs to the class F. Instead of the “integer” B, we use the “subset” B 
of Uw \ Uw-i- Then the class F is chosen as F = Uw-i U B, since F contains 
any odd signed residue modulo 2““^. Because of #F = 10 we should choose 
^B = 2. Thus B is randomly chosen from one of ±9, ±11, ±13, or ±15. The 
attacker cannot guess the value of B because of this random choice. 

4.2 Proposed Algorithm 

We present the algorithm of our proposed algorithm. The algorithm generates 
an SPA-resistant fractional width-w NAF for given n-bit odd scalar d and width 
w. The algorithm is as follows: 
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SPA-resistant.FractionaLWidth-iw.NAF.with.Odd.Scalar 
INPUT An odd n-bit d, and a width w 

OUTPUT d^[n+WQ - 1], d^,[n - 1], d„,[0], and B = {± 6 i, 

1 . wq ^ [■ u ;], wi <— w — (wo — 1 ) 

2. Randomly choose distinct integers 

6i, Gfl C/+ \ U+_, = + 1, 2-0-1 + 3 ^ _ 2-0 - 1}, 

and put B = {± 6 i, ± 6 ^^ 2 »o- 2 }. nJi, 

where C/+ = {1, 3, 5, 2" — 1} for positive integer v. 

3. r <— 0, i <— 0 

4. While d > 1 do the following 

4.1. x\i] ^ (d mod 2 -o+i) _ 2^°, y\i] ^ (d mod 2 -o) _ 2 -o-i 

4.2. if |a:[i]| < 2 -o-i then 

Vi ^ Wo, u[i] -fr- x[i] with Py,; Xi ^ wq — 1 , u[i] ^ y[i] with 1 — P^] 
else if x[i] G B then 

Ti ^ Wo, u[i] x[i] else Xi <— wo — 1 , u[i] -fr- y[i] 

4.3. d^ {d-u[i])/2'^' 

4.4. dw[x + Ti - 1] ^ 0, dw[x + Xi - 2] ^ 0, ..., d„[r + 1] ^ 0,d„,[r] ^ u[i] 

4.5. x-f^x + Xi, i-f^i + 1 

5. dyj [ti + wo — 1] — 0, ..., dyj \x 1] ^ — 0, d^; [r] i — 1 

6 . Return dn,[n+ wo — 1], d„[n — 1], ..., du.[0], and B = {± 6 i, ..., ± 6 „j 2 “o- 2 } 

At Step 1 we assign the integral part wo and the fractional part wi of the 
width w. At Step 2 the pre-computed index bi, ..., 6 ^^ 2 *"o -2 that belongs to upper 
set U+^\U+^_ are randomly chosen. The random signed index B is returned as 
the part of output. The reduction probability is assigned. At Step 3 integers 
X, i are initialized. Step 4 is the main loop of the proposed algorithm. At Step 
4.1 we generate two different residue values x[i\ mod 2 ™“ and y[i] mod 2 -°-i. At 
Step 4.2 one of x[i] and y[i] is assigned for u[i] based on both the size |a;[t]| and 
the probability Pw At Step 4.3 we eliminate least Xi bits of d. At Step 4.4 bit 
information du][i] is assigned. The (r* — 1) consecutive bits after the lowest bit 
du,[r] are zero. At Step 4.5 integers x, i are updated. Finally we return all bits of 
the proposed addition chain. The total bit could be at most Wo bits larger than 
the original n bits. 

The pre-computed points are calculated using not only a base point P and 
a width w but also the randomized index B = { 6 i, ..., 6 ^^ 2 ”o- 2 }. For index 
B the pre-computed points are P,3P, ...,{2'^°~^ — 1)P and biP, ...,b„,_^ 2 "’o-^P- 
The scalar multiplication using the proposed chain is computed as same for 
Scalar_Multiplication_with_Width-w_NAF. 

At Step 4.2 u[i] = x[i] is assigned with probability P^ = a/ 2 “o-^ for a = 
1, 2, .., 2-°-^. We can easily generate the “probability” using a 1-bit random 
number generator as follows: First we obtain a random (wo — 2)-bit number 
xand by executing the 1-bit random number generator wq — 2 times. Then we 
assign u[z] = x[i] if and only if xand < a holds. The probability of xand < a is 
exactly a/ 2 -“-^ due to the uniform distribution of xand in { 0 , 1 , ..., 2 ™“-^ — 1 }. 
A 1-bit random number generator is usually equipped on smart cards. We can 
generate the probability Pw with a small additional cost. 




A More Flexible Countermeasure against Side Channel Attacks 



405 



4.3 Security against SPA 

We discuss the security of the proposed scheme against the SPA. We prove that 
the sequence dw[n],dw[n — 1], du)[0] arisen from the proposed algorithm has 
no correlation to the secret bit information in the sense of SPA. 



Theorem 1. The proposed scheme is secure against the SPA. 



Proof. u[i\ is a non-zero odd integer from the construction of the proposed al- 
gorithm. Thus, any subsequence of the consecutive zero bits in the sequence 
dw[n], dw[n — 1], ..., d,u[0] has the length wq — 1 or wq — 2; 



..0u[z -I- 1] 



Ti — l 



00 .., 



ri = Wo or Wo - 1. 



The corresponding AD sequence is 



..DDA 



D^^^A 

Pi 



DD.., 



r* = Wo or Wo - 1, 



where A and D indicate EC ADD and ECDBL, respectively. Hence, all the in- 
formation that the SPA can obtain from the AD sequence is the length of the 
consecutive zero, namely r^. 

In the following we prove that provides no information about the secret 
scalar d. Indeed we show that two AD sequences p..DD A and p..DD A are 

Wo 11)0 — 1 

independently distributed from the secret scalar d. Here we can assume that 
d is randomly and uniformly distributed in all n-bit odd integers, because d is 
the secret key. Since x[i] and y[i] are assigned dependently on only the lower 
(wo -I- 1) bits of d, they are random wo-bit and (wq — l)-bit signed odd integers, 
respectively, due to the uniform distribution of d. Thus, we consider the lower 
(wq-I- 1) bits of the binary representations of d. Note that the lowest bit is always 
1, and is converted by the preceding d. Thus, we do not need to consider the 
effect of the lowest bit. 

We estimate the probability that x[i] or y[i] is assigned at Step 4.2 of the 
proposed algorithm. We have the following 4 cases: LSB„P+i(d) = 00 * ... * 
1,01 * ... * 1,10 * ... * 1, and 11 * ... * 1, where LSBi„q+i(c?) denotes the lower 
(wq -I- 1) bits of d. First we discuss the case that LSB,„P+i(fi) = 00 * ... * 1. 
In this case we have —2’"° < x[i] < — 2“’o-i^ That is, |x[z]| > 2™°“^. At Step 
4.2, the lower half instructions are executed. Since the probability of x\i] G B 
is ffB/{ffUyjg — ffUwo-i) = Pw, we have = wq with the probability Pyj, 
and ri = wq — 1 with the probability 1 — P^,. Next, we discuss the case of 
LSB„,p+i(d) = 01 * ... * 1. At Step 4.2, the upper half instructions are executed, 
since — < x[i] < 0. Thus we have ri = wq with the probability Pw, and 
ri = Wo — I with the probability 1 — Pw. In the case of LSBu,g+i(d) = 10 * ... * 1, 
the upper half instructions at Step 4.2 are executed, since 0 < x[i] < 2™““^. Thus 
we have ri = wq with the probability Pw, and ri = wq — 1 with the probability 
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1 — Pw Finally, in the case of LSBu,p_|_i((i) = 11*. ..*1, the lower half instructions 
at Step 4.2 are executed, since < x[i] < 2^°. Because the probability that 

x[i] G B is Puj, we have r* = Wq with the probability Pyj, and = wg — 1 with 
the probability 1 — Pw Therefore, the proposed scheme produces Vi = wg with 
probability Py, and Vi = wg — 1 with probability 1 — P^, which is independent 
from d. □ 

We point out that each bit dy,[i] is not randomly distributed in class U 

B. The auxiliary variables x[i] and y[i\ of the proposed algorithm are randomly 
distributed in Uyjg and respectively. The resulting dy,[i] is assigned as x[i] 

with probability Pyj and y[i] with probability 1 — Pyj, respectively. Thus some 
points in class Uy,^-iU B appear with higher probability. However, the proposed 
scheme conceals such points, because B is randomly chosen and points in Uy,^-i 
are also randomly chosen. That is, the attacker might reveal the distribution, 
however he/she cannot detect the correspondence between the point with higher 
probability and the value of dy,[i]. 

On the contrary, if we simply choose predetermined numbers like the frac- 
tional window method, then the scheme is not secure against SPA. For example, 
we consider the case of w = 3 -I- 1/8; B = {±9}. If the length of the consecutive 
zero bit is 3, then the conditional probabilities that the next non-zero dy,[i] = ±9 
are 1/4 each, while the probabilities that dw[i] = ±1,±3,±5,±7 are 1/16 each. 
Thus, the probabilities are not uniform. Since the attacker knows the predeter- 
mined numbers that belong to B, he/she has an advantage to guess dyj[i]. For 
example, the use of the attack proposed by Oswald [Osw02] reduces the cost of 
the exhaustive search for the candidates of the secret key which are not uniformly 
distributed. 



4.4 Memory Consumption and Computation Cost 

We discuss the memory and efficiency of the proposed scheme. 

The efficiency of ECC is strongly depending on the representation of the base 
fields, the coordinate systems, and the definition equations. The proposed scheme 
aims at developing a secure encoding of the addition chain, and it can freely 
choose these parameters. We attach importance to the flexibility of cryptographic 
schemes, so that we estimate no computation cost of individual optimizations. 
We intend to estimate the trade-off between memory consumption (the size of 
pre-computed table) and the computation cost (the density of non-zero bits) for 
the proposed scheme. We have the following theorem. 

Theorem 2. The size of the pre-eomputed table is (1 + . The density 

of the non-zero bits is asymptotically 

Proof. In Step 2 we pre-compute set B whose size is exactly equal to wi2'^°~‘^. 
In addition to the set B, the proposed scheme prepares the pre-computed points 
P, 3P, ..., (2™““^ — 1)P for the base point P, the number of points is 2“°“^. Thus 
the number of all the pre-computed points is (1 -I- wi)2“’°~^. 
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In Step 4.2 we assign the length of the consecutive zero bit is always tco — 1 
(i.e. ^0^t6[z]) with probability Pyj or Wq — 2 (i.e. ^0^t6[z]) with probability 

wq—1 wq—2 

1 — Py,. Therefore the density of non-zero bits is asymptotically ^ ^ . □ 

In Table 1 we summarize these results. The table size includes the base point 
P itself. If w is integral; w = wq, then the size of the table and the density of 
non-zero bit of the proposed scheme are same as those of the scheme proposed 
by Okeya-Takagi, respectively. The proposed scheme interpolates the gap of the 
discrete table sizes 2™“^ for w = 2,3,4, ..., namely all possible numbers of pre- 
computed table can be used. Thus the designers of elliptic curve cryptosystems 
can flexibly choose the table size suitable for the smart card. 



Table 1. Memory and Efficiency of the Proposed Scheme 



Width 


2 


2.5 


3 


3.25 


3.5 


3.75 


4 


4.125 




Table Size 


2 


3 


4 


5 


6 


7 


8 


9 




Non- Zero Density 


0.5 


0.42 


0.33 


0.313 


0.291 


0.271 


0.25 


0.244 





4.5 Other Security Properties 

We discuss other security properties of our proposed scheme, namely a possible 
attack using the DPA and a security comparison with the randomized window 
methods. 

The proposed scheme aims at resisting the SPA, but we discuss the security 
against the DPA. In Section 3, we mentioned that the SPA-resistant schemes can 
be easily converted to be DPA-resistant ones using randomization tricks [Cor99, 
JTOl]. Thus, the proposed scheme can be converted to be DPA-resistant one. On 
the other hand, the window methods using the fixed secret scalar are vulnerable 
to the sophisticated DPA, e.g., the second order DPA [OS02b,OT03] and the 
address-bit DPA [IIT02]. These DPA can detect which pre-computed points 
are called for the EC ADD, and the associated bits of the secret scalar can be 
revealed. Since also countermeasures against such sophisticated DPA attacks 
were proposed in the papers [OS02b,OT03,IIT02], the combined scheme is secure 
against the sophisticated DPA. 

The addition chain of the proposed scheme is generated by randomly choos- 
ing two window lengths 2™“ and 2™““^. There are several window methods that 
intend to protect the DPA by randomizing the addition chain [Wal02a,IYTT02, 
LSOljOAOl], etc. The goal of these schemes is different from ours, but we com- 
pare the security in the sense of the SPA. These schemes produce several AD 
sequences depending on the secret scalar d and random numbers. However, the 
distribution of the AD sequences are not uniform, but depends on the secret 
scalar. Indeed, some of them were broken [OS02a,Wal02b,Wal03a,OS03,HCJ+03] 
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because of such bias. On the other hand, the randomization of the proposed 
scheme is independent from the secret scalar. There are two different AD se- 
quences for the proposed scheme, and the probability of appearing the two AD 
sequences only depends on the width-w, which is a public parameter. 

5 Conclusion 

We proposed an SPA-resistant scalar multiplication for elliptic curve cryptosys- 
tem, which allows us to choose any size of the pre-computation points with 
efficient running time. 

It is expected that smartcards are able to equip highly functional applica- 
tions. In addition, in order to accomplish the aims of users, some applications are 
often appended to (deleted from) the smartcards. The memory space of cryp- 
tographic functions depends on these applications. In other words, the crypto- 
graphic schemes are imposed on the flexibility of the memory consumption and 
efficiency. Indeed, with our proposed scheme, (l)the designer of smart cards can 
ffexibly choose the table size suitable for the individual situations, (2)the private 
information in the smart cards are protected against the side channel attacks. 
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Abstract. Pnblic Key Cryptography Standards (PKCS) has 

gained wide acceptance within the cryptographic secnrity device com- 
munity and has become the interface of choice for many applications. 
The high esteem in which PKCS #11 is held is evidenced by the fact 
that it has been selected by a large number of companies as the API for 
their own devices. In this paper we analyse the security of the PKCS 
#11 standard as an interface (e.g. an application- programming interface 
(API)) for a security device. We show that PKCS #11 is vulnerable to a 
number of known and new API attacks and exhibits a number of design 
weaknesses that raise questions as to its suitability for this role. Finally 
we present some design solutions. 



1 An Introduction to PKCS #11 

The Public Key Cryptography Standards (PKCS) were developed by RSA Se- 
curity Inc. “in cooperation with representatives of industry, academia and gov- 
ernment to provide a standard to allow interoperability and compatibility be- 
tween vendor devices and implementations.” ^ A significant factor in the success 
of these standards can be attributed to this co-operative approach. The stan- 
dards cover a variety of aspects of Public Key cryptography including PKCS #1: 
RSA Encryption Standard, PKCS #11: Cryptographic Token Interface Standard 
[18] and PKCS #8: Private-Key Information Syntax Standard. Many significant 
APIs and protocols have been built upon PKCS #11 (e.g. SSL). Notable prod- 
ucts with PKCS #11 support include Mozilla (the open source browser upon 
which the Netscape browser is based) and SSL hardware accelerators from com- 
panies such as nCipher, IBM, Thales, Rainbow and AEP amongst others. Indeed, 
this research was prompted by the question of the suitability of the PKCS #11 
API as an interface to a hardware security module (or crypto coprocessor) . 

The designers of PKCS #11 described the design goals as follows: to “provide 
a standard interface between applications and (portable) cryptographic devices” 
and at the same time to “allow resource sharing” (a many-to-many relationship 
between applications and devices) . It was not intended to be a general interface 
to cryptographic operations or security services. Rather it could be used to build 
such services, operations or suitable APIs. 

^ Unless indicated otherwise, all quotations and figures are reproduced with permission 
from [18]. 



C.D. Walter et al. (Eds.): CHES 2003, LNCS 2779, pp. 411-425, 2003. 
© Springer- Verlag Berlin Heidelberg 2003 




412 



J. Clulow 



In PKCS #11 terminology, a token is a device that stores objects (e.g. Keys, 
Data and Certificates) and performs cryptographic operations. This is a logi- 
cal rather than a physical characterization; where one device may have several, 
distinct logical tokens (e.g. akin to the concept of distinct domains). When in- 
tending to make use of a token (or to communicate with it), one must first 
establish a session with the token, which requires the user to ‘login’ and to be 
authenticated to the device. Thereafter, the user may make use of the functional- 
ity provided by the token by making calls through the interface or API. Objects 
are characterized as either token objects or session objects. Token objects are 
non-volatile in nature and exist (i.e., are stored) on a token. In addition, they 
possess the property that they are visible to all applications connected to the 
token. In contrast, session objects are volatile, existing only for the duration of 
the session between an application and a token. They only have scope within 
that session (i.e., are only visible to the application which created them). 

Each object has a set of properties that describes the object and controls its 
use. For example, every key possesses the Key Type property which identifies it 
either as a public, private or secret key. Private and secret keys are recognised by 
the standard for the requirement to protect the secrecy thereof, and possess the 
properties sensitive, extractable, always sensitive and never extractable. “Sensi- 
tive keys cannot be revealed in plaintext off the token, and unextractable keys 
cannot be revealed off the token even when encrypted (though they can still be 
used as keys).” 

PKCS #11 describes two types of users: security officers (SO) and normal 
users (users). The security officer is responsible for administering the users and 
for performing such operations as initially setting and changing passwords. Un- 
like normal users they cannot perform cryptographic operations. All users must 
‘login’ (i.e., be authenticated to the token) before they can access the objects or 
capabilities of a token. This is achieved through the use of a personal identifica- 
tion number (PIN), which acts essentially as a password. The standard allows 
for this mechanism to be augmented with or replaced by an alternative, custom 
mechanisms in any given implementation (e.g. PIN entry via PINpad or the use 
of smarts cards). This does not, however, prevent access to other users’ token 
objects although this could be made another implementation feature. 



The Security of PKCS #11 

The standard has the following stated security targets. 

1. “Access to private objects on the token, . . . , requires a PIN. Thus, possessing 
the cryptographic device that implements the token may not be sufficient to 
use it; the PIN may also be needed.” 

2. “Additional protection can be given to private keys and secret keys by mark- 
ing them as ’sensitive’ or ’unextractable’. Sensitive keys cannot be revealed 
in plaintext off the token, and unextractable keys cannot be revealed off the 
token even when encrypted (though they can still be used as keys).” 
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Implied within these statements is the intention that by marking objects as 
’sensitive’ and ’unextractable’, another user is prevented from recovering the 
secret values thereof. It does not appear to be the intention to prevent one user 
from using another user’s private objects. 

The designers discuss several areas of concern including operating system 
security, the actions of rogue applications and the threat posed by Trojan linked 
libraries or device drivers that may subvert security, perhaps by stealing the 
password. Similar concerns related to the ’sniffing’ of communication lines to 
the cryptographic device exist (eavesdropping). This leads to several possible 
compromises such as PIN recovery, unauthorized access to a session (and the 
ability to insert, modify or delete commands) and the impersonation of a token or 
device. However, the standard claims that “. . . none of the attacks just described 
can compromise keys marked ‘sensitive,’ since a key that is sensitive will always 
remain sensitive. Similarly, a key that is ‘unextractable’ cannot be modified to 
be extractable.” Thus, in addition to examining the API for vulnerabilities, we 
are particularly interested in this claimed property. 

A cryptographic device that supports a PKCS #11 faces the following po- 
tential threat models: 

— a malicious security officer who abuses the authority of his position and his 
access to the device and user management functions, 

— a cheating or malicious user who exploits his authorized access to the token, 
and 

— a malicious third party who gains access to the token through some means. 

Essentially, these threats resolve into either gaining access to a session, or gaining 
access to a device during a session (e.g. by injecting messages into communica- 
tions lines) or having knowledge of a password. 

There exist some obvious, well-known attacks that are, generally speaking, 
implementation dependant as opposed to weaknesses in the API itself. We briefly 
describe them for completeness. The C_Login function is potentially vulnerable 
to an exhaustive PIN (password) search since a user can try all possible pass- 
words. One typical defence is to keep a count of the number of failed login 
attempts and ’lock’ the card after a certain threshold of fails has been reached. 
Ideally, the counter should be incremented prior to testing the PIN and decre- 
mented thereafter only if successful. This can lead to a denial of service attack 
where a malicious party tries to prevent a valid user from being able to use the 
token. The attacker repeatedly and intentionally masquerades as the user and 
attempts to login with an incorrect PIN. An alternative approach is to make use 
of time delays during start up and between login attempts. 
CK_DEFINE_FUNCTION (CK_RV , C_Login) 

( 

CK_SESSION_HANDLE hSession, 

CK_USER_TYPE userType, 

CK_CHAR_PTR pPin, 

CK_UL0NG ulPinLen 

); 
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A malicious security officer could use the C_InitPIN function to change a 
given user’s PIN to a known value, hence gaining security access to the token. 
Since all users have access to all objects on the token, another less detectable 
approach would be to make a new user with a known PIN. This new user would 
be able to gain access to the token objects. While the power inherently held by 
a security officer in a given system is understood, PKCS #11 fails to specify di- 
rectly the use of dual control mechanisms, which would defeat a single malicious 
security officer, although not a conspiracy of security officers. 

CK_DEFINE_FUNCTION (CK_RV , C_InitPIN) 

( 

CK_SESSION_HANDLE hSession, 

CK_CHAR_PTR pPin, 

CK_UL0NG ulPinLen 

); 

Key Management Functions 

PKCS #11 provides a typical set of key management functionality including: 

— C_GenerateKey that generates a secret key, 

— C_GenerateKeyPair that generates a public/private key pair, 

— C_WrapKey that wraps (i.e., encrypts) a private or secret key, 

— C_UnwrapKey that unwraps (i.e. decrypts) a wrapped key, and 

— C_DeriveKey that derives a key from a base key. 

Let us consider the C_WrapKey function further. It has the following proto- 
type: 

CK_DEFINE_FUNCTION (CK_RV , C_WrapKey) 

( 

CK_SESSION_HANDLE hSession, 

CK_MECHANISM_PTR pMechanism, 

CK_OBJECT_HANDLE hWrappingKey , 

CK_OBJECT_HANDLE hKey, 

CK_BYTE_PTR pWrappedKey, 

CK_ULONG_PTR pulWrappedKeyLen 

); 



hSession is the session’s handle; pMechanism points to the wrapping mech- 
anism; hWrappingKey is the handle of the wrapping key; hKey is the handle of 
the key to be wrapped; pWrappedKey points to the location that receives the 
wrapped key; and pulWrappedKeyLen points to the location that receives the 
length of the wrapped key. 

C_WrapKey can be used in the following situations: 

— To wrap any secret key with an RSA public key. 

— To wrap any secret key with any other secret key. 

— To wrap an RSA, Diffie-Hellman, or DSA private key with any secret key. 
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2 Symmetric Key API Attacks 

A wrapped key or external encrypted key is commonly referred to as an en- 
crypted key token (T). Keys are typically wrapped (or encrypted) under a key 
encrypting key (KEK) for exchange or under a master key (MK) for storage 
external to the device. Initially we shall consider the wrapping of a secret key 
with another secret key. The mechanism describes the method of the wrapping 
operation and follows a naming convention of the form CKM_<NAME>_<MDDE>. 
For example, CKM_DES_ECB, CKM_DES_CBC, CKM_DES_CBC_PAD, CKM_DES3_ECB, 
CKM_DES3_CBC and CKM_DES3_CBC_PAD are the mechanisms that make use of 
either single DES or triple DES. Other ciphers are possible including RC2, RC4, 
RC5, CAST, IDEA, etc. 

2.1 Key Conjuring 

Key conjuring is any technique that leads to the unauthorized generation of 
keys in the device. It is so named owing to the fact that the keys are ’conjured’ 
(magically created or appearing seemingly out of nowhere). Bond in [6] first 
identified key conjuring as a security risk. This is for two reasons. First, it defeats 
any access control that was placed on the official key generation function by 
providing an alternative and unauthorized mechanism to perform effectively the 
same operation. Secondly, a key conjuring mechanism can be exploited to build 
a large set of keys, which can then be attacked by a parallel search, as described 
in Section 2.6. 

Bond observed that crypto coprocessor designs, which stored keys outside 
the tamper-proof device, were vulnerable to unauthorized key generation. For 
instance, a random 8 bytes submitted as an external encrypted DES key will be 
decrypted and used as key. For example, using random data (i?), a user creates 
a token Trandom = R which is then supplied to the C_UnWrapKey function call to 
the device. The device decrypts Trandom as d,MK (Trandom), yielding a new key 
krandom = di\ 4 K (Trandom) ■ If parity checking is enforced, then there is a 1 in 2® 
chance that this new ’key’ will have the correct parity. By repeating this process 
on average 2® times, an attacker can expect to conjure successfully a new key 
into the system in this manner. In fact this method is available in some older 
devices as a key generation function. Instead of merely testing for parity, the 
function will correctly set the parity in the process. 

Key conjuring can be defeated through the associated use of a MAC or hash. 
This has the property of authenticating the clear value of the key as valid. 

2.2 Key Binding (Integrity) 

We observe that the choice of mode for the C_WrapKey is left to the caller (the 
user) . In addition, there is no enforced use of a MAC or other technique to ensure 
data authenticity. There is also no restriction on the use of keys with repeated 
halves. As a result of the lack of cryptographic binding, one can attack each half 
of a key independently in the following way: 
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1. Export the target double length key (under any key encrypting key and 
in any mode). We denote the double length key as the ordered pair K = 
{Ki,K 2 ) and note that each half is encrypted independently to form the 
encrypted key token (T); 

T = eKEK{{Ki,K2)) 

= {eKEK{Ki),eKEK{K2)) 

= {Ti,T2). 

Re-import the first half of the exported key as a single length key en- 
crypted in ECB mode (using the same key encrypting key); d,KEK{Ti) = 

dKEK{eKEK{Ki)) = Ki. 

Re-import the second half as a single length key encrypted in ECB mode 
(using the same key encrypting key); d,KEK{T 2 ) = dKEK{eKEK{K 2 )) = ^ 2 - 
Perform a key search attack against each single length key (RTi, K 2 ) indi- 
vidually. 



2 . 

3. 

4. 



Algorithm 1: Typical Key Binding Attack 

The key binding issue for double (and triple) length DES keys is well known, 
having been documented in [6] and exploited by [7], [9] and [11]. Indeed, this 
flaw has prompted a warning from the ANSI X9 Financial Services Committee 
[3] and is the subject of several revised proposals [1] and [2]. 

The API should not allow an exported key to be modified (especially the ’cut 
and paste’ action on key components) . Ideally, it should prevent the importation 
of such a modified or ’Trojan’ key by employing some technique to verify that 
it is a genuine and authentic key. A typical solution is the use of a MAC on the 
exported key. 



2.3 Key Separation 

The secret key objects of PKCS #11 do allow for the specification of the use of 
the key for the operations of encrypting, decrypting, signing (MAC generation), 
verifying (MAC verification), key wrapping and key unwrapping. This is done 
through the use of the following attributes: 



Attribute 


Value 


Meaning 


CKA_ENCRYPT 

CKA_DECRYPT 

CKA_SIGN 

CKA_VERIFY 

CKA_WRAP 

CKA_UNWRAP 


CK_BB00L 

CK_BB00L 

CK_BB00L 

CK_BB00L 

CK_BB00L 

CK_BB00L 


TRUE if key supports encryption 
TRUE if key supports decryption 

TRUE if key supports signatures (i.e., authentication codes) 

TRUE if key supports verification (i.e., of authentication codes) 
TRUE if key supports wrapping 
TRUE if key supports unwrapping 



Unfortunately, the API allows the specification of conflicting properties in 
that these attributes can be independently specified. This leads to a typical 
separation attack: 

1. Start with the key {K ) having the ability to wrap keys (i.e., act as a key 
encrypting key) and decrypt data. 
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2. Export the target key {Ktarget) under any key encrypting key {K) using the 
C_WrapKey function yielding the token T = ck (K target)- 

3. Decrypt the resultant token using the C_Decrypt function with K (the 
key wrapping key) as a data decryption key. This returns d,K(T) = 
d-K (ex (Ktarget)) = Ktarget (i.e., the dear value of the target key). 

Algorithm 2: Typical Key Separation Attack 

Since the values of the attributes may be modified using the 
C_SetAttributeValue call or in the process of copying an object using the 
C_CopyObject function, it is possible for an adversary to manipulate existing 
keys. The PKCS #11 documentation does note that a particular implementa- 
tion or token may choose to “ . . . permit modification of the attribute, or may 
not permit modification of the attribute during the course of a C_Copy Object 
call”. 

The problem is exacerbated in the key export/import process since an ex- 
ported (or wrapped) key contains no such separation information bound to the 
token. As a result, any given exported key could be imported twice with different 
attributes. For example, the key could be imported as a key wrapping key the 
first time, and then as a data decrypting key the second time, thus facilitating 
the attack. 

Clearly, greater consideration must be paid to key separation issues in the 
API. Ideally, the choice of attribute combination must be restrictive in order to 
prevent such attacks. Furthermore, such information must be cryptographically 
bound to the wrapped key as in [1]. 

2.4 Weaker Key/Algorithm 

The PKCS #11 specification allows for the wrapping of a key by a second key 
of shorter length. Thus one need only attack the weaker key in order to recover 
the original key. 

1. Export the target double length DES key (Ktarget = (Ki,K 2 )) under a 
single length key (KEK) as 

T = GxEK(Ktarget) 

= eKEK((Ki,K2)) 

= (eKEK(Ki),eKEK(K2)) 

= {Ti,T2). 

2. Export the single length key (KEK) under itself yielding Tkek = 
eKEK(KEK). 

3. Attack the single length key by performing an exhaustive search. 

4. Once the single length key has been recovered, one can trivially recover the 
original double length key. 

Algorithm 3: Example Weaker Key Attack 




418 



J. Clulow 



PKCS #11 supports keys with particularly small key sizes (e.g. RC2), mak- 
ing the search feasible. It should not be possible to downgrade the security, by 
protecting a longer key with a shorter key. Similarly, it should not be possible 
to use a weaker algorithm when exporting keys. 

We note that the previous attacks do not contradict the security claim that 
’sensitive’ and ’unextractable’ keys cannot be compromised, since they require 
that the target key be exportable. What about other attacks? We focus our 
attention on the C_DeriveKey function, which has the following prototype: 

CK_DEFINE_FUNCTION (CK_RV , C_DeriveKey) 

( 

CK_SESSION_HANDLE hSession, 

CK_MECHANISM_PTR pMechanism, 

CK_OBJECT_HANDLE hBaseKey, 

CK_ATTRIBUTE_PTR pTemplate, 

CK_UL0NG ulAttributeCount , 

CK_OBJECT_HANDLE_PTR phKey 

); 



The C_DeriveKey supports the following mechanisms: 

— CKM_CONCATENATE_BASE_AND_KEY, which derives a secret key from the con- 
catenation of two existing secret keys, 

— CKM_CONCATENATE_BASE_AND_DATA, which derives a secret key by concate- 
nating data onto the end of a specified secret key, 

— CKM_CONCATENATE_DATA_AND_BASE, which derives a secret key by prepending 
data to the start of a specified secret key, 

— CKM_XOR_BASE_AND_DATA, which is a mechanism that provides the capability 
for deriving a secret key by performing the exclusive-oring of a key pointed 
to by a base key handle and some data, and finally 

— CKM_EXTRACT_KEY_FROM_KEY that provides the capability of creating one 
secret key from the bits of another secret key. 



2.5 Reduced Key Space 

Using the CKM_EXTRACT_KEY_FROM_KEY mechanism, one can extract a subset of 
the bits from a given key to create a shorter key. The can be used to reduce the 
key space required to be searched. For example, one could extract 40 bits from a 
DES key to create a 40-bit RC2 key, which can then be searched by exhaustive 
means. The actual key space may be smaller owing to the existence of parity bits 
in the DES key. The remaining 24 bits (less 3 parity bits) of the original DES key 
can then be searched for independently. This potentially dangerous mechanism 
relies on the ’unextractable’ flag in the key token to prevent misuse. It does not 
prevent an attacker from using this method to obtain a known key in the system 
or from compromising extractable keys. 
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2.6 Parallel Search 

The CKM_XOR_BASE_AND_DATA provides an easy method with which to exclusive- 
or known patterns onto a key. This can be used to reduce the key space required 
to be searched by generating a large number of (known) related keys as per the 
method suggested by [12] and [19] and exploited by [9]. 

1. Generate a set of 2^® known related keys of original target key {Ki\Ki = 

Ktarget® A,i = 1,...,2^®} where A for ij < 2^®, i ^ j and A is a 

non-zero known value. 

2. Using each key, encrypt a known pattern (P) and store the result in search- 
able database {Ci\Ci = eKi(P),i = 1,...,2^®} . 

3. Search for a key by iteratively performing trial encryptions of the known 
pattern (P) and compare result to entries in database. 

4. After 2®® trial encryptions on average, we expect to find a match (i.e., we 
find a key Ki which produces an encrypted output in the database). 

5. Recover the original target key Ktarget as K tar get = Ki® 

Algorithm 4 : Parallel Key Search Using Related Keys 

Since we know how this key is related to all the others, we known all the 2^® 
keys including the original one. This clearly demonstrates the danger of being 
able to modify a key as well as the true threat posed by the seemingly benign key 
conjuring vulnerability. Knowledge of the modification makes the attack easier 
but is not a requisite for the attack. 

2.7 Related Key Attack 

Using the CKM_XOR_BASE_AND_DATA mechanism, one can create a set of related 
keys with which to perform a related key attack [5], [14], [15]. This can be 
used to reduce 3-key 3DES to only slightly stronger than single DES (reducing 
the key space search to 2®® operations to isolate a key component). The attack 
is elegantly simple and easily explained. Using the related key pair K\ =< 
/cl, k2, k3 >, K2 =< fcl 0 A, k2, k3 >, encrypt a plaintext P with Kl, and then 
decrypt the ciphertext with K2 yielding P'. Then C = eKi{P), P' = diC 2 (C'), 
and hence P' = c/iC 2 (e/ci(P)). Using 3DES in EDE mode (the mode itself doesn’t 
matter): 



~ c?fci0zi(®fc2(dfc3(efc3((ife2(efci(P)) 

Thus, kl has been successfully isolated and can be recovered independently 
of k2 and k3, typically by exhaustive key search. The work required on aver- 
age to effect the search is 2®® single DES operations. Hence the cipher in triple 
mode has been reduced to only slightly greater than the strength of the ci- 
pher in single mode. This attack can be further enhanced by combining it with 
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parallel key search techniques. For example, using a set of related key pairs 
{(< fci 0 A, ^ 2 , ^3 >, < fci 0 ^2 As > I* = ) 2i 6} would reduce the 

average search effort to 2'^° DES operations. 

The 2- key 3DES version of the attack described in [11] is not practi- 
cally feasible. However, there exists a more efficient attack by first ’con- 
verting’ a double length DES key into a triple length DES key using the 
CKM_CDNCATENATE_BASE_AND_DATA mechanism. Following this, the three-key re- 
lated key attack can be used as is. 

Analysis and Implications 

We return to the security claim made by the designers. Both the parallel search 
attack and the related key attack contradict the claims of the API designers. 
This has several implications for individual users who are reliant on the security 
of a PKCS All token. Any user with read and write access to the token has 
the ability to recover all token key objects. In addition, an adversary with the 
ability to gain access to a session (perhaps by injecting raw messages into the 
physical communications lines) likewise has the ability to recover keys from the 
token. To thwart the attack, one must prevent all unauthorized access to token 
objects. This intensifies the security concerns already listed by the designers and 
previously referred to. 

We now consider a means to expand the scope of the attack to include sessions 
with read only access to token objects. The C_CopyObject provides a method 
to copy a read only token object and to produce as output a session object. 
However, since all session objects have read/write access to that session, the 
attacker successfully obtains a duplicate of the key object with write access. 
He can thus attack the session object using the methods previously described, 
despite only having read access to the original target object. Therefore, it is 
advisable to reconsider the functionality of the C_CopyObject call particularly 
with respect to the preservation of properties such as write access. 

Finally, it is worth noting the work done in [9] as it directly reflects on the 
feasibility and speed of performing these attacks in practice. Bond and Clayton 
devised a parallel exhaustive key search machine using an ’off the shelf’ FPGA 
evaluation card costing approximately $1000, which was capable of performing 
a 2^® search in 22 hours. 

3 Public Key API Attacks 

We now extend our focus to consider attacks involving the use of (or against) 
Public Key API functionality. We start by revisiting the C_WrapKey function 
and consider first the wrapping of private RSA keys by symmetric keys. 

Wrapping/ Unwrapping of Private Keys Using Symmetric Keys 

In PKCS Allj private key can only be exported (and imported) if it contains 
not only the private exponent and modulus, but also the public exponent and 
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CRT info. This information is BER-encoded according to PKCS #l’s RSAPri- 
vateKey ASN.l type. The resulting string of bytes is encrypted with a secret key 
in CBC mode and with PKCS padding. 



Attribute 


Data Type 


Meaning 


CKA_MODULUS 

CKA_PUBLIC_EXPONENT 

CKA_PRIVATE_EXPONENT 

CKA_PRIME_1 

CKA_PRIME_2 

CKA_EXP0NENT_1 

CKA_EXPDNENT_2 

CKA_COEFFICIENT 


Big integer 
Big integer 
Big integer 
Big integer 
Big integer 
Big integer 
Big integer 
Big integer 


Modulus n 
Public exponent e 
Private exponent d 
Prime p 
Prime q 

Private exponent d modulo p — 1 
Private exponent d modulo q — 1 
CRT coefficient q — 1 mod p 



The CBC-encrypted ciphertext is decrypted, and the PKCS padding is re- 
moved. The data thereby obtained are parsed as a PrivateKeyInfo type, and the 
wrapped key is produced. An error will result if the original wrapped key does 
not decrypt properly, or if the decrypted unpadded data does not parse properly, 
or its type does not match the key type specified in the template for the new 
key. The unwrapping mechanism contributes only those attributes specified in 
the PrivateKeyInfo type to the newly-unwrapped key; other attributes must be 
specified in the template, or will take their default values. 



3.1 Weaker Key/Algorithm 

Following this description we are immediately concerned with the choice of sym- 
metric key algorithm (and key length) used to protect the RSA private key 
leading to equivalent attacks described in Section 2.4. 

3.2 Private Key Modification 

Consider the effect of replacing one block of the ciphertext (i.e., the wrapped 
key) with a different value. When the key is unwrapped, this will cause the 
corresponding block of plaintext as well as the following block to have different 
values. The rest of the key remains intact. The length of the BER encoded big 
number data types depends upon the size of the big numbers (typically 512, 
1024 or 2048 bit numbers). In any event, they consist of at least a number of 
blocks. Thus an attacker can modify one of the big numbers independently of 
the other data in the wrapped private key (including the padding at the end). 
If the various key components (e.g. n, p, q, e, d, d mod p — 1, d mod q — 1 and 
q — I mod p) are not explicitly tested for consistency, the attacker gains access 
to a modified ’Trojan’ key in the system. This can be used to effect the Fault 
Analysis attacks of [8], [4] and [13]. A similar attack against PGP private keys 
is described in [16] and, more generally, against public key APIs in [17] and [10]. 

A possible solution is that encrypted private keys have a strong cryptographic 
method to ensure integrity of the key (e.g. MAC, hash or signature). In addition. 
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the integrity of the key must be confirmed using simple arithmetic checks (for 
example, is dp = d mod p and n = p ■ q). 

Wrapping/ Unwrapping of Symmetric Keys Using Public Keys 
Techniques 

PKCS #11 supports two mechanisms for wrapping symmetric keys using Public 
Key techniques, namely: 

- CKM_RSA_PKCS (PKCS #1 RSA), and 

- CKM_RSA_X_509 (X.509 Raw RSA). 

The CKM_RSA_X_509 mechanism performs no padding or manipulation of data 
prior to encryption. It merely “...encrypts a byte string by converting it to 
an integer, most-significant byte first, applying ‘raw’ RSA exponentiation, and 
converting the result to a byte string, most significant byte first.” The encrypted 
token isT = mod n where e is public exponent, k the key being exported and 
n the modulus. This simple method results in exported keys being vulnerable 
when encrypted under small public exponents. 



3.3 Small Public Exponent with No Padding 

The clear key is right justified in the field provided, and the field padded to the 
left with zeroes up to the size of the RSA encryption block (e.g. for 128-bit key 
k = k\k 2 ■ ■ ■ k\ 2 s is prepended with zero bits O 1 O 2 . . . 0 ;_i 28 fcifc 2 ■ • ■ ^128 , where 
I is the length of the modulus). The resultant field is encrypted yielding T = k^ 
mod n. If fc® < n (i.e., e < ^ ■ Thus k can be 

recovered as k = T^ . 

Due to the speed advantages of having a small exponent with low Hamming 
weight, it is common for public keys to have exponents of 3 and 2^® -|- 1. It 
is not uncommon to be able to specify this as an option in many APIs when 
generating a public key. It is thus possible that a suitable public key will exist 
in the system. In any event, the public keys in PKCS #11 are clear tokens and 
thus one can easily ’conjure’ or create a public key with an exponent of 3. This 
weakness exists in a number of APIs [10]. 

3.4 Trojan Public Key 

As previously mentioned, the public keys in the PKCS #11 API are clear tokens 
with no additional authentication checks. Thus it is possible to use any clear 
public key as input to the C_WrapKey function. This allows an attacker to use a 
’Trojan’ public key for which he knows the corresponding private key (typically 
the attacker will probably generate the key pair himself). He requests the PKCS 
#11 token exports the target key k under his supplied public key obtaining the 
response T = k^ mod n. Since the attacker knows the corresponding private 
exponent d, he can easily recover the key as T’^ mod n = (A:®)‘^ = k. This simple 
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method can be used to recover all exportable keys regardless of whether they 
are symmetric or private keys. It is thus clear that a public key needs to be 
authenticated before use to verify that it indeed has the authority to export a 
given key. 

3.5 Trojan Wrapped Key 

Similarly to the unauthenticated use of public keys, there is no method to verify 
that a wrapped key token is indeed authentic. Thus given a PKCS #11 device 
containing a private key (< d, n >), and knowledge of the value of the public 
key (< e,n >), the attacker proceeds as follows. He chooses an arbitrary key 
k, which he then ’wraps’ under the known public key obtaining T = mod n 
. He then calls the C_UnWrapKey function supplying this ’Trojan’ wrapped key 
T and referencing the handle of the private key inside the device. The PKCS 
#11 token calculates T'’* mod n = {k^)'^ = k and imports the known k as a new 
key into the system. The attacker can then use k to export other keys from the 
device, which he can then decrypt and recover. Thus there exists a requirement 
to provide a means to verify the authenticity and origin of the wrapped key. 

3.6 Key Separation 

A symmetric key wrapped by a public key contains no separation information 
and can be exploited as described previously in Section 2.3. 

4 Solutions 

Some of these security issues can be easily addressed in the implementation of 
a PKCS #11 API. The more concerning issues unfortunately require a design 
change to the PKCS #11 standard. With the latter come the dual concerns 
of backwards compatibility and interoperability with other systems. A lack of 
backwards compatibility may be the price for a previously flawed design and a 
commitment to security. 

The Key Conjuring and Key Binding attacks are perhaps best addressed 
through a change in the external key token format, particularly for wrapped 
keys. There exist proposals such as [1] and [2] and one can expect a decision 
and guidance from such influential bodies as ANSI Financial Services Commit- 
tee, which will largely address the interoperability issues. Key Separation can 
be partially addressed by a given implementation that does not permit the con- 
flicting use of key attributes (e.g. CKA_WRAP and CKA_DECRYPT). However, the 
fact that the wrapped key contains no separation information is a fundamental 
design flaw and like the Key Conjuring and Key Binding attacks must be ad- 
dressed through a new external key token format. The Weaker Key/ Algorithm 
attack can be prevented by a given implementation by understanding and obey- 
ing the principle that a key should not be protected by a weaker key or algorithm. 
The ’unextractable’ and ’never extractable’ flags do offer protection against the 
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Reduced Key Search attack. Regardless, the author is not convinced that the 
CKM_EXTRACT_KEY_FROM_KEY mechanism deserves consideration in the API. Sim- 
ilarly, the CKM_XOR_BASE_AND_DATA mechanism creates the opportunity for both 
the Parallel Search and Related Key attacks. Again one may question the need 
for such a function, particularly in its present form. 

Prevention of the Private Key Modification attack requires either the use of a 
consistency check to confirm the integrity of the key components, which could be 
implementation specific, or else a revision of the encrypted RSA key token that 
ensures integrity through some cryptographic means, such as an encrypted hash 
or MAC over the token. The Small Public Exponent with No Padding attack 
highlights the dangers of providing raw RSA functionality. The most sensible 
solution is to enforce the use of a recognised padding scheme. The only concern 
here would be backwards compatibility. Interoperability should not be an issue 
since any device that uses this method to export a key is obviously vulnerable 
to the attack. The Trojan Public Key and Trojan Wrapped Key attacks exploit 
a lack of authentication of public keys used for export and wrapped keys being 
imported. This requires a significant change to the standard to achieve these 
goals. 

5 Conclusions 

This paper has shown the susceptibility of PKCS #11 used as an API to a 
number of attacks. The attacks are efficient, computationally trivial and easy to 
implement. Some possible solutions are presented to defend against the attacks. 
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Abstract. In this paper we present a practically feasible attack on RSA-based 
sessions in SSL/TLS protocols. We show that incorporating a version number 
check over PKCS#1 plaintext used in the SSL/TLS creates a side channel that 
allows an attacker to invert the RSA encryption. The attacker can then either 
recover the premaster-secret or sign a message on behalf of the server. Practical 
tests showed that two thirds of randomly chosen Internet SSL/TLS servers were 
vulnerable. The attack is an extension of Bleichenbacher’s attack on PKCS#1 
(v. 1.5). We introduce the concept of a bad-version oracle (BVO) that covers 
the side channel leakage, and present several methods that speed up the original 
algorithm. Our attack was successfully tested in practice and the results of 
complexity measurements are presented in the paper. 



1 Introduction 

In contemporary cryptography, it is widely agreed that one of the most important 
issues of all asymmetric schemes is the way in which the scheme encodes the data to 
be processed. In the case of RSA [14], the most widely used encoding methods are 
described in PKCS#1 [9]. This standard also underlies RSA-based sessions in the 
family of SSL/TLS protocols. These protocols became de facto the standard platform 
for secure communication in the Internet environment. In this paper we assume a 
certain familiarity with their architecture (c.f. §5). Since its complete description is far 
beyond the scope of this article, we refer interested readers to the excellent book [10] 
for further details. In 1998 Bleichenbacher showed that the concrete encoding method 
called EME-PKCSl-vl_5, which is also employed in the SSL/TLS protocols, is 
highly vulnerable to chosen ciphertext attacks [1]. The attack assumes that 
information about the course of the decoding process is leaking to an attacker. We 
refer to such attacks as side channel attacks, since they rely on side information that 
unintentionally leaks out from a cryptographic module during its common activity. 

Bleichenbacher showed that it is highly probable that side information exists 
allowing the attacker to break the particular realization of the RSA scheme in many 
systems based on EME-PKCSl-vl_5. He has also shown how to use such information 
to decrypt any captured ciphertext or to sign any arbitrary message by using a 
common interaction with the attacked cryptographic module. As a countermeasure to 
his attack it was recommended to either use the EME-OAEP method (also defined in 
PKCS#1) or to steer attackers away from knowing details about the course of the 
decoding process. In the case of the SSL/TLS protocols it seemed to be possible to 
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incorporate the second type of countermeasures. The story of the attack ended here by 
incorporating appropriate warnings in appropriate standards [9], [10], [12], and [15]. 
Security architects were especially instructed not to allow an attacker to know 
whether the plaintext P being decoded has the prescribed mandatory structure marks 
or not. 

Besides being warned to carry out the above-mentioned countermeasure, architects 
were also instructed to carefully verify all possible marks of P that are specific for the 
SSL/TLS protocols. In particular, they were told to check the correctness of a version 
number (c.f. §5.2 and [12]), which is stored in the two left-most bytes of the 
premaster-secret. Unfortunately, it has not been properly specified how such a test 
may be combined with the countermeasure mentioned above and what to do if the 
"version number test" fails. Designers may be very tempted to simply issue an error 
message. In reality, however, such a message opened up a Pandora’s box bringing a 
new variant of side channel attack. In this paper we present this attack. It turns out 
that the version number, which was initially believed to rule out the original attack 
[1], even allows a relatively optimized variant of the attack if the version number 
check is badly implemented. Our practical tests showed that among hundreds of 
SSL/TLS servers randomly chosen from the Internet, two thirds of them were 
vulnerable to our attack (for details see §4.3). 

We note that the TLS protocol may be historically viewed as an SSL bearing the 
version number 3.1 [12], while the SSL with the version number 3.0 is often referred 
to as a "plain" SSL. There are some minor changes between SSL and TLS, but these 
changes are unimportant for the purpose of this paper, since we rely on the general 
properties, which are common to both SSL v. 3.0 and TLS. Therefore, we will talk 
about them as about the SSL/TLS protocols. We note that SSL protocols with version 
numbers less than 3.0 will not be considered here, since they have already been 
proven to have several serious weaknesses [10], [16]. 

The rest of the paper is organized as follows: in §2 we introduce a bad-version 
oracle (BVO), which is a construction that mathematically encapsulates side 
information leaking from the decoding process. The BVO is then used for mounting 
our attack in §3. The attack is based on an extended variant of Bleichanbacher’s 
algorithm from [1]. The complexity of the attack together with the statistics of the 
vulnerable servers found on the Internet are given in §4. We note that due to page 
constraints, paragraphs 5 (Technical details) and 6 (Countermeasures) are fully 
elaborated in the extended version of the paper [19]. The conclusions are made in §7. 
In the appendix we recall a slightly generalized version of the original 
Bleichenbacher’s algorithm [1]. 

Proposition 1 (Connection and session). Unless stated otherwise, the term 
connection means the communication carried out between a client and a server. It 
lasts from when the client opened up a networked pipe with the server, until the pipe 
is closed. The term session is used to refer to a particular part of this connection 
which is protected under the same value of symmetrical encryption keys. 

Proposition 2 (RSA-based session). We say that the session is RSA-based if it uses 
the RSA scheme to establish its symmetrical keys. 
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2 Bad-Version Oracle 

We start by recalling the definition of PKCS-conforming plaintext [1]. Unless stated 
otherwise, the term plaintext means an RSA plaintext. Furthermore, we denote RSA 
instance parameters as (N, e, d), where A is a public modulus, e is a public exponent, 
and is a private exponent, such that for all x, x e <0, N - 1> it holds that x = (x' mod 
Nf mod N. We denote as k the length of the modulus N in bytes, i.e. k = l (log^AO/Sl, 
and the boundary B as B = 256*'^. 

Definition 1 (PKCS-conforming plaintext). Let us denote the plaintext as P, P = 
Xi^j(P*256'‘'‘), 0 <P. <255, where Pj is the most significant byte of the plaintext. We 
say that P is PKCS-conforming if the following conditions hold: 

i) P, = 0 

ii) P^ = 2 

Hi) P. ?^0forall j e <3, 10> 

iv) 3j, j € <11, k>, P. = 0; the string is then called as a message 

M or a data payload 

The definition describes the set of all valid plaintexts for the given modulus of the 
length k bytes. In the case of SSL/TLS protocols, however, only the subset of this set 
is allowed, since these protocols introduce several extensions to the basic PKCS#1 (v. 
1.5) format. Therefore, we define the term S-PKCS-conforming plaintext as follows. 

Definition 2 (S-PKCS-conforming plaintext). We say that P is S-PKCS-conforming 
if it is PKCS-conforming and the following conditions hold: 

i) P. 7^=0 for all j e <3, k - 49> 

ii) Pk-4s = 0 

The main restriction introduced here is the constant number of data bytes (which is 
equal to 48). The number of padding bytes equals A: - 51. Furthermore, SSL/TLS 
protocols introduce a special interpretation for the first two data bytes P^^, and P^^, 
which are respectively regarded as major and minor version numbers. This 
extension was introduced to thwart so-called version rollback attacks. The data 
payload, which is the concatenation of P^^, || P^^ || P^^, || ... || P^, is called a premaster- 
secret here. It is the only secret used in the key derivation process that produces the 
session keys used by the client and the server in the given session. An attacker, who is 
able to discover the premaster-secret, can decrypt the whole communication between 
the client and server which has been carried out in the session. The value of P^^, || ... || 
Pj is generated randomly by the client who then adds the version number P^^, and P^ 
45 , encrypts the whole value of the premaster-secret by the server’s public RSA key, 
and sends the resulting ciphertext C to the server. The server decrypts it and creates 
its own copy of the premaster-secret. 

It is widely known that the server shall not report whether the plaintext P, P = Cf 
mod N, is PKCS-conforming or not. In practice, a server is recommended to continue 
with a randomly chosen value of the premaster-secret if the value of P is not S- 
PKCS-conforming. Obviously, the communication breaks down soon after sending a 
Finished message, since the client and the server will both use different values for 
the session keys. However, the client (attacker) does not know whether the 
communication has broken down due to an invalid format of P or due to incorrect 
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value of the premaster-secret. So, the attack is effectively defeated in this way. Of 
course, the attacker still gains some information from such an interaction with the 
server. She may at least try to confirm her guesses of the correct value of the 
premaster- secret. However, it has been shown by Jonsson and Kaliski [4] that it is 
infeasible to exploit this information for an attack. 

Let us suppose that the server incorporates the above-mentioned countermeasure, 
the primary aim of which is to thwart Bleichenbacher’s attack [1]. Furthermore, let all 
S-PKCS-conforming plaintexts be processed by the server to check the validity of 
proprietary SSL/TLS extensions according to the following proposition. 

Proposition 3 (Conjectured server’s behavior). 

i) The server checks if the deciphered plaintext P is S-PKCS-conforming. 
If the plaintext is not S-PKCS conforming, the server generates a new 
premaster-secret randomly, thereby breaking down the communication 
soon, after receiving the client’s Finished message. 

ii) The server checks each S-PKCS-conforming plaintext P to see whether 
Pj^^y = major and P^^^ = minor, where major. minor is the 
expected version number which is known to the attacker. For instance, 
the most usual version numbers at the time of writing this paper were 
3.0 and 3.1. If the test fails, the server issues a distinguishable error 
message. The test is never done for plaintexts that are not S-PKCS- 
conforming. 

Practical tests showed that it is reasonable to assume Proposition 3 is fulfilled in 
many practical realizations of SSL/TLS servers. 

Definition 3 (Bad-Version Oracle - BVO). BVO is a mapping BVO: {0, Ij. 

BVO(C) = 1 iff C = P‘ mod N, where e is the server’s public exponent, N is the 
server’s modulus, and P is an S-PKCS conforming plaintext, such that either Pj^,,y ^ 
major or Pj^^^ ^ minor, where major. minor is the expected version number. 
BVO(C) = 0 otherwise. 

BVO can be easily constructed for any SSL/TLS server that acts according to 
Proposition 3. We send the ciphertext C to the server and if we receive the 
distinguished message from (ii), we set BVO(C) = 1. Otherwise, we set BVO(C) = 0. 

Theorem 1 (Usage of BVO). Let us have a BVO for given (e, N) and maj or . minor 
and let C be an RSA ciphertext. Then BVO(C) = 1 implies that C = P‘ mod N, where 
P is an S-PKCS-conforming plaintext. 

Proof. Follows directly from Definition 3. 

■ 

Because S-PKCS-conforming plaintext is also PKCS-conforming, it follows from 
Theorem 1 that we can use BVO to mount Bleichenbacher’s attack. We discuss the 
details in §3. Now we introduce several definitions that will be useful in the rest of 
this paper. We use a similar notation to the one used in [1]. 

Definition 4 (Probabilities concerning BVO). Let Pr(A) = B/N be the probability of 
the event A that the conditions (i-ii) of Definition 1 hold for randomly chosen 
plaintext P. Let Pr(S-PKCS\A) be the conditional probability that the plaintext P is S- 
PKCS-conforming assuming that A occurred for P. Let Pr(BVO\S-PKCS) be the 
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conditional probability that BVO(P^ mod N) = 1 assuming that P is S-PKCS- 
conforming. 

For Pr(A) we have 256'^ < Pr(A) < 256'' as stated in [1]. The probability Pr(5'- 
PKCS\A) can be expressed as Pr(5-P^TC5|A) = (255/256)“''"'*256'', since the length of 
the non-zero padding bytes must be equal to A:-51. There is usually one value of the 
version number that is expected by BVO. Therefore, Pi(BVO\S-PKCS) = l-256'l Note 
that the value of Pr(ByO|5-P/rC5')*Pr(5-P/rC5'|A)*Pr(A) is the probability that for a 
randomly chosen ciphertext C we get BVO(C) = 1. 



3 Attacking Premaster-Secret 

3.1 Mounting and Extending Bleichenbacher’s Attack 

This attack allows us to compute the value x = y‘‘ mod N for any given integer y, 
where d is an unknown RSA private exponent and N is an RSA modulus. This attack 
works under the condition that an attacker has an oracle that for any ciphertext C tells 
her whether the corresponding RSA plaintext P = Cf mod N is PKCS-conforming or 
not. Theorem 1 shows that BVO introduced in the previous part can be used as such 
an oracle. In the case of the SSL/TLS protocols this means that we can mount this 
attack to either disclose a premaster-secret for an arbitrary captured session or to 
forge a server’s signature. In the following text, we mainly focus on the premaster- 
secret disclosure. Forging of signatures is discussed briefly in §3.4. 

The main idea here is to employ Bleichenbacher’s attack with several changes 
related to the specific properties of S-PKCS and BVO (§3.2). Furthermore, we 
employed particular optimizations, which we have tested in our sample programs, and 
which generally help an attacker (§3.3). 



3.2 S-PKCS and BVO Properties 

We show how to modify Bleichenbacher’s original RSA inversion algorithm for use 
with the BVO and to increase its efficiency. For the sake of completeness we repeat 
the necessary facts from [1] in the appendix together with a brief generalization of it. 

Recall that PKCS-conforming plaintext P satisfies the following system of 
inequalities 



E<P<F, 

where E = 2B, E = 3B-1, and B = 256*'^. The boundaries E, F are extensively used 
through the whole RSA inversion algorithm. Since BVO as well as the SSL/TLS 
protocols deal only with S-PKCS-conforming plaintexts, we may refine the 
boundaries as 



E’<P<F\ 

where the value of £’ is obtained by incorporating the minimum value of the 
padding and the value of F’ is computed with respect to the fixed position of the zero 
delimiter in the plaintext P: 
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E’=2B+ 1*256“ + 1*256“ + ... + 1*256'"’ = 2B + 256‘"’(256“' - l)/255 and 

F’=2B + 255*(256“ + 256“ + ... +256'"’) + 0 + 255*(256'" + 256'"’ + ... + 256°) = 
35-255*256'*- 1. 

Substituting E’, F’ in place of E, F in the original algorithm (see the appendix) 
increases its effectiveness. 

It follows from the definition of the protocol SSL/TLS that the attacker knows the 
expected value of the version number, which is checked by BVO. Therefore, when 
attacking the ciphertext C„, such that BVO(C„) = 0, carrying the premaster-secret, the 
attacker knows exactly the two bytes and of the S-PKCS-conforming 

plaintext P„ = C/ mod N. She also knows that Poj^g = 0. We used this knowledge in 
our program to further trim the interval boundaries <a, b> computed in step 3 of the 
algorithm (see the appendix). 



3.3 Basic General Optimizations 

Besides the optimizations that follow directly from §3.2, we also used the generally 
applicable methods described in the following subparagraphs. 

Definition 5 (Suitable multiplier). Let us have an integer C. The integer s is said to 
be a suitable multiplier for C if it holds that C’ = s‘C mod N = (P’f mod N, where P’ 
is a S-PKCS-conforming plaintext. 

3.3.1 Beta Method 

The following method ((3-method) follows from a generalization of the remark 
mentioned in [1], pp.7 - 8. 

Lemma 1 (On linear combination). Let us have two ciphertexts C. and C, such that 
Ci = (sfC„ mod N, Cj = (SjfC„ mod N, where s^ and Sj are suitable multipliers for Q. 
Le. P. = C," mod N = 2B -I 256'" PS, + D, and P^ = C/ mod N = 2B + 256"" PS^ + D^, 
where 0 < PS,, and 0 6D,. < 256'"". Then for C, C = s‘C„ mod N and fi € Z, where s = 
[( 1 -P)s, + psp mod N, it holds that Cf mod N = P, such that P = [2B -¥ 256*"( ( 1-P)PS, 
+ PPS,) + (l-P)D, + PO^ ] mod N. 

Proof. It suffices to observe that P = [(1-y^j. + Ps^P„ mod N = [{\-P)P, + PP^ mod N, 
where P„ = C/ mod N. 

■ 

It follows from the lemma written above that once we have suitable multipliers s,j 
for a ciphertext C, we can try to search for the next suitable multiplier s as for a linear 
combination of s, and Sy In practice, we can try small positive and negative values of 
P and test whether the particular linear combination s gives S-PKCS-conforming 
plaintext or not. Working in this way, we may hope to accelerate the algorithm in step 
2b (c.f. the appendix). Since we can reasonably assume that gcd(j. - s,,N) = 1, there is 
a particular value of P for every triplet of suitable multipliers {s,, Sj, s). However, 
experiments have shown that there are also differences in how much information can 
be obtained from such s depending on the size of p. For small values of P, it has been 
observed that the obtained values of s do not reduce the size of M, as fast as the values 
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of s obtained for P close to M2. The reason is perhaps a linear dependency on Z, 
which is stronger for small p. On the other hand, P close to M2 clearly cannot be 
directly found by "brute force" searching. More precisely, we may find such P 
directly, but we cannot assure that obtained s will be of moderate size for further 
processing by the RSA inversion algorithm. Therefore, it remains to extract as much 
information as possible from reasonably small values of P and then to either continue 
with incremental searching used in the original version of the algorithm [1] or to use 
the Parallel-Threads (PT) method described in §3.3.2. In advance of the following 
discussion, we note that the source for the next incremental searching or for the PT- 
method is the maximum suitable multiplier s. found, such that s. < M2. 

When using the above-mentioned method with negative values of P, we may get a 
multiplier s that is close to N (it can be regarded as a small negative value modulo N). 
Such an s cannot be directly processed, since it induces a very large interval for r in 
the original algorithm (see step 3 in the appendix). We will show how the algorithm 
can be adjusted to process small positive values of s as well as small negative values 
of s modulo N. 

Theorem 2 (On symmetry). Let us have integers s, P, and N satisfying 
Ej <sP mod N <Fj, where E,, Fj e Z. 

Then there is the integer v, v = N - s, satisfying 

E^ < vP mod N < F^, where E^ = N - Fj, F^ = N - E,. 

Proof. We have that vP mod N = {N - s)P mod N = {-sP) mod N = N - {sP mod N). 
The upper boundary of (sP mod N) is F^, therefore, the lower boundary E^ of (vP mod 
N) is E^ = N - F^. Analogically, the upper boundary F^ of (vP mod N) is given by the 
lower boundary E^ as F^ = N ~ E^. 

■ 

We use the theorem as follows: if we get a high value of s using the (3-method 
described above, then we convert it to the corresponding symmetric value v = N - s 
which is then processed in a modified version of step 3 of the algorithm (see the 
appendix). The core of the modification is using boundaries E^, F^ instead of the 
original boundaries E^ = E\ F^ = F’(c.f. §3.2). 

3.3.2 Parallel-Threads (PT) Method 

Recall that the complexity of step 2 of the algorithm (see the appendix) for i > 1 
depends on the size of M j. Generally, the step is expected to be much faster if |M J = 
1 than if \M. \ > 1. The reason is that |M. J = 1 means there is only one interval 
approximating the value of left and therefore certain rules can be used when 
searching for the next suitable multiplier s^. Experimenting with our test program, we 
observed that even if |M. J > 1, the number of intervals was usually small enough that 
it was better to start a parallel thread T for each I e M,. , as if it was the only interval 
left, i.e. it starts its own thread in step 2c of the algorithm. These threads T^, ..., Tj,, 
where w = \M. f were precisely multitasked on a per BVO call basis. They were 
arranged in the cycle ... and stepping was done in the cycle after each 

one BVO call. The results obtained when thread T. found a suitable multiplier were 
projected on the whole current set of intervals for all threads. After that, the threads 
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belonging to the intervals that disappeared were discarded. We observed that the PT- 
method increased the effectiveness of the original algorithm. 

Using a certain amount of heuristics we set the condition that directs whether we 
should use the PT-method or not. The PT-method is started in step i if the following 
inequality holds 

I Mj < (2fPr(A)) ' + 1. 

The value of e estimates the number of passes it takes from the start of the PT- 
method until there is only one interval left, i.e. J = 1, where the PT-method 
started in pass i. In our programs, we used e = 2 which was the ceiling of the mean 
value observed for e. 



3.4 Note on Forging a Server’s Signature 

The BVO construction allows us to mount Bleichenhacher’s attack without any 
restrictions on its functionality. As noted above, we can compute the RSA inverse for 
any integer y, thereby obtaining the value x = y‘‘ mod N for the particular server’s 
private exponent d and the modulus N. Discussing the so-called semantics of the 
attack, there are only two cases in which it would be reasonable to compute this 
inversion. 

In the first case we compute the RSA inverse for a captured ciphertext carrying an 
encrypted value of the premaster-secret. This approach allows us to decrypt the whole 
communication that was carried out in a given session between a client and the server. 
This is the main approach of this paper, which we have practically tested and 
optimized. 

In the second case we compute an RSA signature of a message m on behalf of the 
server. The whole attack runs in a similar way, which means that the main activity 
between an attacker and the server is still concerned on the phase of passing the 
premaster- secret value during the handshake procedure of the SSL/TLS protocols. 
However, this is only because we need to build up a BVO (c.f. §2) for computing the 
RSA inversion. The source of this inversion (the ciphertext C) will no longer be an 
encrypted premaster-secret itself, but the formatted value of h(m), where h is an 
appropriate hash function. Currently, the SSL/TLS protocols sets him) MD5(m) || 
SHA-l(m) and the value of h(m) is further formatted according to the EMSA-PKCSl- 
vl_5 method from PKCS#1 ([9], [10], [12], [13], [17]). At the end of the attack we 
obtain Cf mod N which is the signature of our input C. It further depends on the 
keyUsage property [18] of the certificate of the server’s RSA key, whether such a 
signature can be used for further attacks or not. At first the server’ s RSA key must be 
attributed for signing purposes. Secondly, it depends on the specific system as to how 
far the faked signature is important, directly implying how dangerous the attack is. 
From the basic properties of SSL/TLS ([10], [12]) it follows that such a signature may 
be abused to certify an ephemeral RSA or D-H [11] public key of a faked server. The 
faked server can then be palmed on an ordinary user to elicit some secret information 
from her. Generally speaking, this would be an attack on the authentication of a 
server. The necessary condition here is that the user is willing to use either the so- 
called export RSA key or the ephemeral Diffie-Hellman key agreement [11]. The 
practical situation is that some clients will - some clients will not. It strongly depends 
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on the attention paid to the configuration of such a client. Unfortunately, these 
"minor" details are very often neglected in a huge amount of applications. Moreover, 
we emphasize that the attack described here may not be the only one possible, since 
the particular importance of a server’s signature depends on the role that the server 
plays in a particular information system. The best way to avoid all these attacks is to 
not attribute the server’s RSA key for signing purposes, unless it is absolutely 
necessary. 

From the effectiveness viewpoint, we can estimate that using the RSA inversion 
based on BVO for signature forging will require more BVO calls, since we need to 
insert an extra masking zero-step (see appendix, step 1 of the algorithm). The number 
of additional BVO calls may be calculated as [Pr(ByO|S-P/fC5')* Pr(5'-PA'CS|A)* 
Pr(A)] ', which is given by the probability that for a randomly chosen ciphertext C we 
get BVO(C) = 1. Adding this value to the number of BVO calls in the former attack 
on premaster-secret (c.f. §4) gives an estimate of the overall complexity of signature 
forging. 



4 Complexity Measurements 

Basing on the elaboration from [1], we can estimate the number of BVO calls for 
decrypting a plaintext C„ belonging to a S-PKCS-conforming plaintext P„ as 

2*Pr(P) ' H- (16k - 32)*Pr(P|A) ', where Pr(P|A) = Pi(BVO\S-PKCS)*Pr(S-PKCS\A), 
Pr(P) = Pr(P, A) = Pr(P|A)*Pr(A), 

where Pr(P) is the probability that for a randomly chosen ciphertext C we get 

BV0(O = 1. 

This estimation does not cover the optimization described in §3.2 and §3.3. 
Therefore we treat it as the worst-case estimation for a situation when these 
optimizations are not notably helping an attacker. Experiments show that the 
optimized algorithm is practically almost two times faster than this estimation (c.f. 
§4.1) for the most widely used RSA key lengths. Let us comment on the expression of 
the estimation now. 

The first additive factor corresponds with our assumption that the attacker wants to 
decipher C„ belonging to a properly formatted plaintext carrying a value of the 
premaster-secret. In such a situation, she does not have to carry out initial blinding 
(c.f. the appendix, step 1). According to [1], we can estimate that she needs to find 
two suitable multipliers Sj ^ for C„, until she can proceed with the generally faster step 
2c. This gives the first factor as 2*Pr(P)'\ Note that, heuristically speaking, the 
optimizations (§3) mainly reduce the necessity of finding s^ in the “hard” way, 
thereby decreasing the first factor closely to the value Pr(P)'\ This hypothesis 
corresponds well with the results of our measurements. 

The second factor is a slightly modified expression presented in [1]. It corresponds 
to the number of expected BVO calls for the whole number of passes through step 2c. 
Recall that C„ = (P„)' mod N, where 2B < < 35 - 1, so lays in the interval of the 

length B, B = 256*1 Conjecturing that each pass through step 3 roughly halves the 
length of the interval for P„, we may estimate that we need S(k - 2) passes. 
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Furthermore, it is conjectured [1] that each pass through step 2c takes 2*Pr(P|A) ' 
BVO calls. From here follows the estimation of BVO calls as (16A: - 32)*Pr(P|A) '. 

Finally, we note that the complexity of the attack is mainly determined hy the 
amount of necessary BVO calls. This amount actually limits the attack in the three 
ways. The first one is that an attacked server must bear such a number of corrupted 
Handshakes [12] (i.e. not collapse due to a log overflow, etc.). The second limitation 
comes from a total network delay that increases linearly with the number of BVO 
calls. The third limit is determined by the computational power of the server itself, 
which mainly means how fast it can carry out the RSA operation with a private key. 
Other computations during the attack are essentially faster and therefore we do not 
discuss them here. 



4.1 Simulated Local BVO 

In this paragraph, we present the measured complexity of the attack with respect to 
the total amount of BVO calls. The data of our experiment was obtained for the four 
particular randomly generated RSA moduli of 1024, 1025, 2048 and 2049 bits in 
length. For every such modulus we implemented a local simulation of BVO that we 
linked together with the optimized algorithm discussed in this paper. We then 
measured the number of BVO calls for 1200 ciphertexts of the randomly generated 
and encrypted values of the premaster-secret. 

Due to the strong dependence of the number of BVO calls on Pr(A) we see that the 
complexity of the attack is not strictly increasing with respect to the length of the 
modulus N. This discrepancy was already mentioned in [1]. It follows that one should 
use moduli with a bit length in the form 8r, where r is an integer, mainly avoiding the 
moduli with the length 8r + 1 . 



Table 1. Basic statistics of a measured attack complexity in BVO calls 



Modulus 

length 

(bits) 


BVO calls 


Originally 

estimated 

(without 

optimizations) 


Practically measured 

(with optimizations from §3) 


Min 


Max 


Median 


Mean 


1024 


36 591 001 


815 835 


278 903 416 


13 331 256 


20 835 297 


1025 


979 488 


630 589 


105 122 011 


1 197 380 


1 422 176 


2048 


48 054 328 


2 824 986 


354 420 492 


19 908 079 


28 728 801 


2049 


2 794 937 


1 413 005 


475 298 397 


3 462 557 


3 896 432 



Analyzing the measured data, we observed that the distribution of the amount of 
BVO calls can be approximated by a log-normal Gaussian distribution, i.e. the 
logarithm of the amount of BVO calls roughly follows a normal Gaussian 
distribution. Heuristically speaking, this means that the most basic random events 
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governing the complexity of the attack primarily combine together in a multiplicative 
manner. The values of median, and mean are presented in Table 1. These values were 
obtained using the log-normal approximation of the data samples measured. These 
approximations are plotted in Fig. 1 . We can see that all the distributions skew to the 
right. Therefore, the most interesting values are perhaps given by the medians. For 
example, in the case of a 1024 bits long modulus, we can expect that the one half of 
all attacks succeed in less then 13.34 million BVO calls. Furthermore, the data in 
Table 1 supports our conjecture that the optimizations proposed in §3 mainly speed up 
the first “hard” part of the algorithm. Therefore, this speeding up is clearly notable for 
moduli of 1024 and 2048 bits, while there is no observable effect for the moduli of 
1025 and 2049 bits. 




Fig. 1. Log-normal approximation of BVO calls density functions: in the left graph for 1024 
(the higher peak) and 2048, and in the right graph for 1025 (the higher peak) and 2049 bits long 
moduli 



4.2 Real Attack 

We successfully tested the attack on a real SSL server (AMD Athlon/1 GHz, 256MB 
RAM) using 1024 bits long RSA key. The total number of BVO calls for decryption 
of a randomly selected premaster-secret was 2 727 042 and the whole attack took 14 
hours 22 minutes and 45 seconds. It gives an estimated speed of 52.68 BVO 
invocations per second. The server and the attacking client were locally connected via 
a 100 Mb/s Ethernet network without any other notable traffic. With respect to the 
whole conditions of this experiment, we can conclude that this is probably one of the 
best practically achievable results. Therefore, we can expect that there would be few 
practical attacks succeeding in less then 14 hours of sustained high effort (for a 1024 
bits long RSA key). Using the value of the median for 1024 bits modulus from Table 
1, we can roughly expect one half of all attacks in our setup to succeed in less than 70 
hours and 18 minutes. For 2048 bits long RSA key in the same setup we get an 
estimated speed of 11.47 BVO calls per second. Therefore, one half of all attacks 
should then succeed in less than 21 days. 

The experiment setup described above could be slightly improved by using a more 
powerful server. Plugging in such a server (2x Pentium 1II/1.4 GHz, 1 GB RAM, 100 
Mb/s Ethernet, OS RedHat 7.2, Apache 1.3.27), it was possible to achieve a speed of 
67.7 BVO calls per second for a 1024 bits RSA key. The median time for a whole 
attack on the premaster-secret could be then estimated as 54 hours and 42 minutes. 
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Note that all these estimates assume achieving and sustaining high communication 
and computation throughput on the server’s side. 



4.3 Real Vulnerability 

To assess the practical impacts of the attack presented here, we had randomly 
generated a list of 611 public Internet SSL/TLS servers (we accepted servers 
providing SSL v. 3.0 or TLS v. 1.0) and then tested these servers to see whether it was 
possible to construct a BVO for them or not. We found that two thirds of these servers 
were vulnerable to our attack. We emphasize that it does not necessarily mean that the 
attack would always succeed on every such server. Despite the fact that all these 
servers can be regarded as broken from a pure cryptanalytic viewpoint, the 
complexity of the attack may still render it impractical in a large amount of cases. We 
expect that a properly administrated server (e.g. log messages are often inspected, 
suspicious clients are added to black-lists, etc.) should withstand the attack. Under 
such an administration, the attack should be recognized and the attacking client would 
soon be blocked. Of course, the cryptographic strength of all these SSL/TLS 
implementations should definitely be improved. We strongly recommend applying 
appropriate patches as soon as possible. 

We observed an interesting anomaly for 110 out of 611 tested servers. All of them 
provided both SSL v. 3.0 and TLS. 26 of them were primarily vulnerable only 
through the SLL v3.0 protocol, while the remaining 84 servers were primarily 
vulnerable only through the TLS protocol. We advisedly used the word "primarily", 
since if these servers share the same RSA key for both protocols, which is a very 
common practice, then an attacker can easily assault one protocol through an 
interaction with the other one. Moreover, the format of the ciphertext carrying the 
premaster- secret is the same for both protocols, so this cross-attacking actually does 
not increase the complexity of the whole attack. 



5 Technical Details 

Please refer to the extended version of this paper [19]. 



6 Countermeasures 

Due to the compatibility demands, it does not seem possible to simply leave the EME- 
PKCSl-vl_5 method and use its successor EME-OAEP. Note that even the EME- 
OAEP method must be implemented carefully (c.f. [5], [6]). On the other hand, it has 
been recently shown by Jonsson and Kaliski in [4] that the EME-PKCSl-vl_5 can 
offer reasonable security (the proof was carried out for the TLS protocol) assuming 
that it is implemented properly - i.e. mainly that side channels are avoided. What 
remains is to show what a proper implementation should look like. The current 
guidelines in [12] together with [15] are obviously insufficient and should be updated 
to avoid weaknesses like the one discussed in this paper. Moreover, it seems that the 
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edge between secure and insecure implementation of EME-PKCSl-vl_5 is very 
sharp. This implies that the standards regarding its implementation must really be 
very precise. 

We propose to keep generating ..., randomly, if P is not S-PKCS- 
conforming. Furthermore, we propose to replace and with the expected 
version number in either case (i.e. if P is or is not S-PKCS-conforming). A more 
detailed description and discussion of this subject is provided in the extended version 
of the paper [19]. 



7 Conclusions 

We have presented a new practically feasible side channel attack against the SSL/TLS 
protocols. When Bleichenbacher presented his attack on PKCS#1 (v. 1.5) in 1998 [1], 
it was generally assumed that the attack was impractical for the SSL/TLS protocols, 
since these protocols add several proprietary restrictions on the plaintext format, 
which increase the complexity of the attack. Of course, the protocols could not be 
called secure from a pure cryptographical viewpoint. Therefore, a special 
countermeasure was introduced and generally adopted [10], [12]. However in this 
paper, we have shown that problems with Bleichenbacher’ s-like attacks on the 
SSL/TLS protocols are still not properly solved. We have identified a new possibility 
of a substantial side channel occurring during an SSL/TLS Handshake. The side 
channel originates when a receiver checks a version number value stored in the two 
left-most bytes of the premaster-secret. Based on the receiver’s behavior during this 
check, we have defined its mathematical encapsulation as a bad-version oracle (BVO, 
c.f. §2). Such a check is widely recommended for SSL/TLS servers, but unfortunately 
it is not properly specified how it should be performed. Practical tests showed that 
two thirds of randomly chosen Internet servers carried out the test wrongly, thereby 
allowing the construction of BVO resulting in a new attack on RSA-based sessions. 
The attack itself may be viewed as an optimized and generalized variant of the 
original Bleichenbacher’ s attack [1]. The most obvious target of our attack would 
probably be discovering the premaster-secret, thereby decrypting a captured RSA- 
based session. It is also possible (with an additionally increased complexity, c.f. §3.4) 
to compute the signature of any arbitrary message on behalf of the server. 

The attack was carried out in practice and its efficiency was measured (§4). The 
amount of time the attack takes in practice is mainly determined by the amount of 
BVO calls. Each BVO call corresponds to one attempt to establish a SSL/TLS 
connection with an attacked server. If the server uses a typical 1024 bits long RSA 
key, then we can expect that roughly 50% of attacks succeed in less than 13.34 
million BVO calls. This load may be further spread in time and even distributed to 
many computers. The main aim would not be speeding up the attack, but making its 
localization and blocking harder. Although the complexity presented here is definitely 
very low from a pure cryptographic viewpoint, there may still be technical measures 
that can thwart the attack in a practice. For instance, each BVO call should produce at 
least one log record on the server’s side. If these logs are well maintained and 
appropriately inspected, then the attack should be recognized in time. Unfortunately, 
there also seem to be poorly administrated servers where SSL/TLS audit messages are 
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almost ignored. These servers remain protected solely by their network and 
computational throughput, which is obviously alarming. 
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Appendix 

For the sake of completeness we enclose here the algorithm from [1]. For our 
purposes we define directly a slight generalization and modification of it. Recall that 
in the original text E = 2B, F = 3R-1, where B = 256*'^. In our variant, we will use the 
refined values E’ and F’ (c.f. §3). According to Definition 1 and the original notation 
used bellow, we note that a ciphertext C is said to be PKCS conforming iff C = F 
mod N, where P is PKCS-conforming plaintext. The modified algorithm is as follows. 

Step 1: Blinding. Given an integer c, choose different random integers then check, 
by accessing the oracle, whether c(j„)' mod N is PKCS conforming. 

For the first successful set 

c„ ^ c(s„)' mod N 
M,^{[E,F\] 
iA— 1. 

Step 2: Searching for PKCS conforming messages. 

Step 2.a: Starting the search. If / = 1, then search for the smallest positive 
integer Sj > M(F+1), such that the ciphertext c(jj)' mod N is PKCS 
conforming. 

Step 2.h: Searching with more than one interval left. Otherwise, if i > 1 and 

the number of intervals in M. ^ is at least 2, then search for the smallest 
integer s- > s. ^ such that the ciphertext c(s,)' mod N is PKCS conforming. 
Step 2.c: Searching with one interval left. Otherwise, if M._^ contains exactly 
one interval (i.e. Mj = {[a, b\]), then choose small integer values s. such 
that 

r.>r2(i?s,, -£)/iVl 

and 

{E + r.N)/b < < (F + r.N)Ia 

until the ciphertext c(v,)' mod N is PKCS conforming. 

Step 3: Narrowing the set of solution. After s. has been found, the set M. is 
computed as 

M. ^ n j,,) { [max (a, f(F + rA0/s,l ), min (b, L (F + rA)A, J )] } 
for all [a, b\ e M._^ and {as. - F)/N <r< {bs, - E)IN. 

Step 4: Computing the solution. If M contains only one interval of length 1 (i.e., M 
= {[a, a]}), then set m fl(v„) ' mod N, and return m as solution of m = c‘ (mod N). 
Otherwise, set i <r- / + 1 and go to step 2. 
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